Ship data pipelines with extraordinary velocity with Dagster.
Dagster helps data engineers tame complexity. Elevate your data pipelines with software-defined assets, first-class testing, and deep integration with the modern data stack.
Dagster is a cloud-native open-source orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
Came here excited to learn about new features in the latest Dagster version. But it looks like you've decided to widen the feature-gap between the open source offering and the enterprise offering... even though this will be a maintenance burden on your team... causing delays in "backporting" features and bugfixes to the open-source version going forward. Kinda disappointed...
Thanks for the comment @JohnCF. If you go through the enhancements introduced with this Dagster+ launch, you will see that many of them (in fact, all of them except for Dagster Insights) benefit both the open-source and the commercial offerings. The data cataloging capability is a good example of that. From our perspective, these new additions are moving us forward on both the OSS and the Dagster+ roadmaps. In addition, by providing more value to those organizations that adopt Dagster+ we are able to guarantee the longevity and accelerated development of Dagster Open-Source.
@@dagsterio Does that mean what's mentioned at 7:15 about column lineage is available in open-source too? The phrasing definitely sounded like it's only available for Enterprise users...
Is there native support for mapping time based partitions to static partitions defined like "today", "rest of month", "rest of year", "rest of history"? This is a common setup for power bi datasets, which can be represented as assets in dagster. Would be nice to take advantage of auto materialize policies.
Dagster does not natively support mapping time-based partitions to static partitions like "today," "rest of month," "rest of year," and "rest of history" directly out of the box. However, you can achieve similar functionality by defining custom partitioning schemes and using the appropriate partition mappings. You can define custom partitions using StaticPartitionsDefinition for static and TimeWindowPartitionsDefinition for time-based partitions.
If the requirement is to get the data from S3 files into a BQ table but perform some validations on those files before inserting into the table, how would we do it with Embedded ELT? We are using Dagster OSS heavily and looking to use embedded-elt for getting data from files, tables and APIs..
Hey Abishek! In your case, would you be able to represent the S3 files as source assets first, adding asset checks onto those, and running Embedded ELT only if those asset checks pass? Sling currently (afaik) is heavily focused on doing ingestion well, so you can defer to the rest of the Dagster ecosystem (such as asset checks) for validations.
@@AbhishekAgrawal-dv1id we've found that dlt is a powerful framework for ingesting from APIs and it's definitely mature enough for production settings. I'll also say that neither Sling's or dlt's integration currently allow for creating asset checks in-flight during ingestion. Instead, have you thought about ingesting the files into a quarantined dataset first using whichever tool you'd like, applying asset checks to that, and then moving that data to your real "analytics-ready" BQ datasets once you've vetted the data? This way, you can do ad hoc analysis to understand why the data failed data quality tests easily, but also keep it isolated from your production analytics.
Yeah, I am also leaning towards doing something like this. Thanks for this, Tim. Would you suggest using a similar approach to pull data from a different database? We'd still need to run minor validations on the incoming data, though. Would dlt help here at all?
As a person with just 2 years of experience my mind was blown watching this. I am a single person writing code in my department so I don't have any seniors to learn from but I'm leading a data engineering project that deals with terabytes of data and each request is multiple times larger than the server's RAM and multiple such requests need to be processed in parallel to complete stuff in time. Also, we have the tiniest possible budget to aggregate 25 to 30 columns and billions of rows every day. Also, we need to cut down on costs. This was super helpful.
For some teams, definitely, although it can be complementary to dbt docs, because it sucks in some of the data via the dbt integration. Essentially becomes a super set of documentation
Many of the enhancements in the 1.7 release benefit all users (Open-source and paid Dagster+ users). In general, the open-source solution gains more capabilities with each release both to support open-source users and to unlock more capabilities in Dagster+ which are built on top of core.
Hi! I had to put it in a different repo to accommodate for running multiple code locations and not breaking our existing setup for the deep dive projects. The dedicated repo for the data mesh example can be found here! github.com/dagster-io/data-mesh-demo
This is the coolest tech demo I've ever seen. I have wanted for so long to see an end-to-end analytics stack demo, or tutorial, and never found it. You just did it in 15 minutes, using free, open source tools I can run locally on my laptop. Absolutely incredible!
You might find this blog by Sandy interesting: dagster.io/blog/dagster-ml-pipelines. - Otherwise you can listen to the entire Podcast featuring Sandy here: datastackshow.com/podcast/machine-learning-pipelines-are-still-data-pipelines-with-sandy-ryza-of-dagster/
I work in a financial institution and there is definitely a need for a reliable and resilient data process. Look forward to finding out more about Dagster. I also agree, no point building something flaky and have it barf 🤢
More specifically for this session: github.com/dagster-io/devrel-project-demos/tree/main/dagster-deep-dives/dagster_deep_dives/resources_and_configurations
I don’t know… this video is one year old, but still uses the legacy DAG syntax from Airflow 1, rather than the TaskFlow API from Airflow 2. So the syntax doesn’t make a difference anymore. Regarding the coupling to environment: Airflow has different executors. The KubernetesPodOperator is not the only way to run on a Kubernetes environment. The rest may or may not be true. Probably there are many things that Dagster does better than Airflow. But I’m disappointed that you would publish such a biased comparison.
All the code for the demos from the deep dives are in this repository ( github.com/dagster-io/devrel-project-demos )! This one in particular is in the partitions directory.
Joining other comments, I'd love to see more step-by-step tutorials and use cases. It took a few videos to grasp the concepts, and this one is a good one to start with. Docs are good, but videos are even better. I would love to see more of duckdb / dagster and ingestion cases.
Hi @user-hs9lo5gh3r, the most common way to bring up this menu is to select an asset from the global asset lineage, and then in the top right where it says "Materialize selected...", open the dropdown menu and select "Open launchpad". Hope this helps!
I really want to love Dagster but watching this video reminded me of why I stopped using Dagster for moving data from point A to point B. There are so, so many layers of configuration and plain infrastructure all over the place that kind of just needs to be there that the actual business logic (you know, the valuable part of the code that defines the data product) gets completely buried.
IMO, one of the most confusing concept and unnecessarily convoluted item in Dagster (which is otherwise amazing). Eg what’s with RunConfig that has references to `ops` but then things have to be keyed/named by asset name. You totally glossed over the global config item (eg s3 bucket that is common to everyone ) then you have to use an awkward resource that doesn’t really do anything other than holds some fields (ahem config). I really wish this would get cleaned up.
Hey @Amapramaadhy, what you’re expressing is totally valid. The concepts of Assets, Ops, and Jobs and how to compose them can be a bit convoluted - this has become more noticeable as our APIs evolve. We’re aware of this, and it’s on our roadmap to improve. Thanks for taking the time to respond and sharing your thoughts.
No doubt that evey new powerful framework takes some investment up front to learn. Have you explored Dagster University? courses.dagster.io/courses/dagster-essentials
In terms of debugging, being able to run dagster in debug mode in vscode, set breakpoints, inspect variables is game changer. Here is how to setup it: github.com/dagster-io/dagster/issues/17859#issuecomment-1805916514
Awesome 👏🏽. Really nice and succinct description of an otherwise tricky feature. Hopefully a future video can cover advanced use cases of how to wire up sensors with partition definitions so that we can programmatically launch/backfill etc. Thanks again for the great content.