Data News: Dagster 1.0 Launch Recap

This week (2022-08-09), at the Dagster Day, Elementl announced General Availability (GA) of their Dagster Cloud, and with it, Dagster 1.0. Dagster Cloud is now operationally robust, coherent, and easy to work with. It's advertised as an enterprise orchestration platform that puts developer experience first, fully serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.

It was not a typical Launch with lots of features presented, but instead the Elementl team showed off their rock solid product four years in the making with lots of user feedback, now production-ready for the public to use. Nevertheless, there was still one big feature announcement with the brand new Branch Deployments.

💡 The solidity of the product and the thinking that went into the product seems to be proven by the community and user testimonials. Lots of users were gushing about Dagster’s developer experience in every corner of the product. They see the team and the tools did start because of a lack of tooling and improving it. 

Announcements

Nick Schrock, the CEO of Elementl, kicked off Dagster Day by recapping the past four years. He explained the challenges they see when every software company eventually becomes a data company: SaaS apps are operational apps that transform and handle data. Data tools are too fragmented. This is where Dagster can help most.

Sandy Ryza followed up with a presentation showing how Dagster is different from other data orchestration tools as it helps you at every stage (dev, staging, ..., prod) with its innovative concept of Software-Defined Assets. The SW-Defined Assets can be seen as a layer between computation and Data Assets. It sets the ground for declarative orchestration. Sandy pointed out that it's easier to use as no mental shift is involved in translating data flow to task-based operations.

He said how Dagster, with its SW-Defined Assets, will reduce the state of chaos when you start scaling with 100 or 1000 jobs as you define each asset instead of arbitrary pipelines that only exist to produce the assets (less boilerplate and more direct defined). Each asset also knows the dependencies on other assets, so Dagster can take intelligent actions and only refresh needed data assets and much more to come.

Owen Kephart showed off a demo using Pandas, starting with some simple assets. How you can separate the storage layer from the asset layer itself and how the declarative model works even without the physical asset, where we do not need to think about a task at any stage. Then he added a dbt project pointing to a local dbt project for the transformation and adding an arbitrary python ML model. You see how the Dagster UI called Dagit visualized the Data Lineage and code with different Asset Groups and Global Asset Lineage.

Showcase Asset Lineage in Dagster
Showcase Asset Lineage in Dagster

Dagster Cloud

Further, Nick was talking about the Dagster Cloud offering. Mainly they have two deployments options:

  • Serverless: Effortless spin-up (currently short waitlist)
  • Hybrid: Bring your own compute (no waitlist)

Hybrid deployment gives the users maximum flexibility and security; no code is on the instance of Dagster. The Serverless option is meant that you can focus on your python scripts, and the rest is managed and handled by the cloud. Within a Hybrid deployment, as a company hosting all the stateful system infrastructure, the only thing running on your cloud is the agent that handles the communication.

For heavy scalability, you'd use the Hybrid mode, where you can scale out your Kubernetes infrastructure as much as you want. The serverless option usually outsources the heavy lifting with push-downs to compute technologies such as Apache Spark, Dask, and more.

Demonstrates the differences from Dagster Cloud
Demonstrates the differences from dagster.io/cloud

Another thing mentioned is that authentication supported with Google, Github, email, and enterprise plans can leverage SAML-based SSO (Okta, Active Directory, etc.). And the development process at Elementl (the company behind Dagster) is now SOC2 certified.

Dagster Cloud Pricing

Pricing is the key to successfully bringing adoption; we know that at Airbyte very well; thus, we constantly iterate on it.

The Dagster team wants to support small startups to big companies with the same pricing model. The scaling must be straightforward and predictable. As most other orchestration tools charge by tasks, which leads to a price on duration. That way, they say you pay for growth.

They have different plans from standard with one production deployment, Serverless and Hybrid with a flat fee per compute minute.

Overview of the different pricing plans
Overview of the different pricing plans (Dagster Cloud Pricing Explainer)

New Feature Branch Deployments

Next up was the principal feature announced alongside the Dagster Cloud, the Branch Deployments. It is a lightweight, staging environment created with every pull request that becomes a focal point of development, testing, and collaboration. So how does it work?

Nick explaining Branch Deployments
Nick explaining Branch Deployments. Click for the Explainer Video

It deploys your branch, including orchestration coded and a copy of your production data, to Dagster Cloud. Afterward, you can access and run jobs on this clone of productions. When you have completed your work in your PR and validated your changes in the branch deployment, you merge back your changes to the `main` branch, and then new code is deployed into production, and your branch deployment goes dormant.

This way of working is compelling as it replaces the need to copy some data locally and try to set up a test environment. It is hard to do that in another way, for example, on a local machine mocking any clusters locally, or staging or test environment is tough to set up (although easier with Kubernetes) but more challenging if you use SaaS services, for example.

📝 Copying data only works if your underlining technology supports it. E.g., Snowflake has a cloning feature that allows Dagster inside the deployment to clone relevant data into that new branch deployment on the cloud. Another tool that supports this technology for Data Lakes is LakeFS. The more user will adapt them more they will probably request other storage technology.

Stability of APIs and Software-Defined Assets

They stressed how positive the feedback was to Core Abstractions changes, including the newer Software-Defined Assets and that they are doubling down with them by adding many features in the future without breaking the existing API.

If you haven't heard of Software-Defined Assets before, they are the glue between ops, jobs, assets, and declarative orchestration enablers. Existing abstractions such as `ops, graphs, jobs, schedules, and sensors` will not go away; they are the foundation of Dagster. But Nick anticipates that more code will be written with SW-defined assets in the future as it reduces the need for writing lots of boilerplate. I recently explained the trends and included this shift from an imperative to a declarative with data assets on Data Orchestration Trends: The Shift From Data Pipelines to Data Products.

Wrapping Up

The overall Dagster Day took one hour and can be re-watched here. Dagster also posted Articles announcing everything in written form.

Relevant Links to all about the Dagster Launch 1.0:

I hope you liked this product news heads-up. If you have comments, or critics, join our Slack Community to network with 6000+ data engineers. If you want to stay up to date with more Data Insights, you can sign up for the Newsletter, and we will keep you posted with upcoming articles.

Open-source data integration

Get all your ELT data pipelines running in minutes with Airbyte.