1) If you meant defining the .devcontainer.json, VSCode provides templates out of the box. When you select in the command pallet (Cmd+shift P on Mac) : Reopen in Container > Add configuration ... > You would be prompt with different templates (Python/Nodejs, etc) that you can easily customized. 2) If you meant just running these, you need docker for desktop installed and then use the command pallet and select "Reopen in container" - this will detect the current .devcontainer.json configuration and open within the devcontainers
@@motherduckdb Thanks. Also had another question. Why was there a need for models.py file and pydantic and on top of that do tests for that. I understand it is used to validate data types but can you please provide a more detailed explanation. Apologies if its a noob question. Also I really liked your approach of using Makefile and tests, most other tutorials dont do this. This is the best tutorial I have encountered uptil now. Please make more tutorials as there isnt a good data engineering course online. Please make a course or more tutorials would also be great.
thanks for your kind words! We'll do our best! To answer your question, you want to validate that you define the right "model" for you data, hence the tests against the models defined in models.py. These tests validate that the model definition is correct and that it will throw an errors when the model doesn't match what we expect.
Thanks for another great practical video. For sure interested in learning about orchestration and data quality practices that are accessible like the previous videos.
I am just waiting for native support for read/write on hudi/iceberg/delta (any of them really). I cannot wait to replace spark whenever that happens. (There is the delta-rs project but I can't use it yet with a lot of our existing delta tables)
Yes! You can use a post-hook strategy and macros to have multiple outputs. Essentially having a macro that does a COPY to a given location. Hope it helps!
Does anyone know how to connect to SSMS using a linked server? Is it even possible? I'm trying to use duckdb as an in-between for csv files and then move them to SQL Server.
Do we still need to install Java, Hadoop and modify the environment variables on the local machine to do this or we just install DuckDB and pip install pyspark and start using sparksession and sparkcontext? Thank you.
It's an API translation, so meaning you can write spark code, but the execution is done on DuckDB if you want. So in that case, no pyspark/java/hadoop needed. Hope it clarify!
yes, pip install works great. embedded means that it’s a library that can run inside your application. it will work in-memory but can also use files and perform larger-than-memory operations.
I guess I'm stating the obvious but for anyone who doesn't use SQL for data operations DuckDB is second class. And I surely do not like to use SQL for transformations and such.
I researched a bit and found that dbt defer does not necessarily depend on the environment/adapter, but on the two manifests it will compare to compile the queries. So I see no reason why Motherduck wouldn't play nicely.
Really interesting! Please can you add the links for the recomendations and for those tutorials Matt commented about very large data? That would be quite helpful to dig more into this topic. thanks!!
Hey Mehdi, I’ve been enjoying these Quack and Code sessions, keep up the great content! I would suggest having a different layout of setup for when you are showcasing code because the screen is very low resolution.
Thank you so much for a great tutorial series. I'm looking forward to Part 3 - Dashboarding. I am also looking forward to a video on Python Runtime tools and data orchestration. Thank you again!