Will Rust Take over Data Engineering? 🦀
Interesting Open-Source Rust Projects
The language is always only as good as its community. Let's look at some of the existing open-source tools and frameworks built in and around Rust:
- DataFusion based on Apache Arrow: Apache Arrow DataFusion SQL Query Engine similar to Spark
- Polars: It's a faster Pandas. Probably going to compete with DuckDB (?)
- Delta Lake Rust: A native Rust library for Delta Lake, with bindings into Python and Ruby
- Cube: Headless BI for Building Data Applications
- ~Written mostly in Rust, Cube’s data processing and storage are based on the Arrow DataFusion query execution framework, which uses Apache Arrow as its in-memory format. Especially the core of Cube, the cache layer called Cube Store is 100% built-in Rust
- Vector.dev: A high-performance observability data pipeline for pulling system data (logs, metadata)
- ROAPI: Create full-fledged APIs for slowly moving datasets without writing a single line of code
- Meilisearch: Lightning Fast, Ultra Relevant, and Typo-Tolerant search engine
- Tantivy: A full-text search engine library
- PRQL: Pipelined Relational Query Language for transforming data
- Many more; please let me know of any
Less relevant to data engineering, but still cool:
- Deno: This is a fast Node.js version
- Tauri: Tauri is a framework for building tiny, blazingly fast binaries for all major desktop platforms
- Yew: A modern Rust framework for creating multi-threaded front-end web apps with WebAssembly.
Rust vs. Python
The downside of Rust, the learning curve is much higher than other languages, such as Python. That's why most Rust programs in data engineering will have a Python wrapper for integrating it into any Python data pipelines for a long time. It's also a shift from an interpreted language such as Python to a more Functional Language (FP) style, which Rust certainly supports.
Other Recent Programming Languages
Newer programming languages follow the functional programming approach. New functional programming languages started, such as Scala with Akka, Elixir, or multi-paradigm programming languages such as Julia, Kotlin (a fastest-growing language since Google made it default for Android development), and Rust.
GoLang seems to be a good compiled programming language usedin DevOps.
Elixir has servers monitoring data pipelines and re-tries included in the language; no framework is needed. It makes an excellent fit for data engineering and would replace parts of the Data Orchestrators.
Rust as a Primary Language?
Let's see an example of a modern data pipeline integrating with Airbyte, dbt, and some ML models in Python.
Each step can have errors and data mismatches. That's why we have orchestrator frameworks such as Dagster, which force you to write functional code or the concept of Functional Data Engineering. There is also lots of adoption in Python with the type hint or writing more Python and Functional Programming style. Or to bring up an example of another language, JavaScript, the rise of TypeScript.
❓ The exciting question to me is whether Rust will be adapted as a primary language and can do data orchestration work?
As we typically load data into a data frame and transform or add some business logic within our data pipelines. This could be done efficiently with Rust and Apache Arrow, and DataFusion, which is type-safe, and a good ecosystem. Time will tell.
Will Rust Be the Programming Language for Data Engineers?
Rust is a multi-use language and gets the job done for many problems of a data engineer. But the data engineering space is dominated by Python (and SQL) and will stay that way for the foreseeable future. There is no "until people fully move into Rust". It's hard to express how many tools and frameworks are written in Python to interoperate with other Python tools. It's pretty hard to imagine that inertia changing substantially in the next decade.
The Rust projects we have seen above are excellent and will continue to grow for vital and core components, but for them to be helpful for the average data engineer. What was once supposed to be Scala will now be Rust —a backend tooling language to do tasks that need fast and well-maintained code, including a Python wrapper on top.
Writing libraries in Rust feels more like writing long-term infrastructure than writing in higher-level languages such as Python, Java, or the JVM.
What do you think? What is your take on Rust for data engineers?
Read more to gain insights into the evolving landscape of programming languages in the data engineering domain and explore our comprehensive article delving into the comparison of SQL vs. Python for data analysis.
Resources to Learn More on the Topic
Suppose you want to be up and running within minutes. Karim Jedda has an article, carefully exploring the Rust programming ecosystem as a 10+ years Python developer, checking how to do everyday programming tasks and what the tooling looks like. Shared Services of Canada did a hands-on example with Rust converting raw archive files into JSON for data analysis. Or Mehdi Ouazza's article where he debates the Battle for Data Engineer's Favorite Programming Language.
Learning Rust has many excellent resources. A half-hour to learn Rust, The Rust Book, Rust By Example, Read Rust, or This Week In Rust.
Or Learning Rust with different kinds of formats:
Or do you want to get hands-on and search for an example project? How about building an Airbyte Delta Lake Destination (Python interface) with delta-rs?