The Unified Lakehouse Platform for Self-Service Analytics
Bring users closer to the data with lakehouse flexibility, scalability, and performance at a fraction of the cost. Dremio's intuitive Unified Analytics, high-performance SQL Query Engine, and Lakehouse Management service for next-gen dataops let you shift left for the fastest time to insight
Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?
When you start talking about 10TB+ datasets yo run into issues on whether database can hold the dataset and performantly. Also different purposes need different tools so you need your data in a way that be used by different teams with different tools.
Also with data lakehouse tables there doesn’t have to be any running database server when no one is querying the dataset since they are just files in storage while traditional database tables need a persistently running environment.
Thanks for the great video. Question: when we first the DELETE command in the lesson2 branch, does the data also appear in minio ? Like, does minio object storage shows both lesson2 branch and main branch separately ? I am curious this because on minio, there is only data and metadata partition, and there is not directory for main vs lesson2 branch.
I think I got it now. Storage layer does not have this concept of branches, so in the waraehouse/data/ directory, it stores parquet files both lesson2 branch and main branches. I can tell this because there are files with different timestamp associated with my sql operations in each branch.
Thank you so much! I have a question. I'm wondering if there might be any way to do these procedures automatically in Iceberg. Do I have to do these things in person every time?
Hi Alex, Really tankful to you for such nice explanation and handson. I got stuck at 'CREATE BRANCH IF NOT EXISTS lesson2 IN nessie' . This keeps failing with error message "syntax error at or near 'BRANCH'". Am I missing something? Kindly assist.
If you want pm me (Alex Merced) your spark configs. Usually it’s a typo or an update that needs to be made the spark configs. Spark can be very touchy on he config side which is one reason using Dremio for a lot of iceberg operations is so nice (much easier).
Awsome Video !! At 3:18 when explaining different delete format I have question regards to the implementation : As the delete mode only accept MOR or COW , how exactly do I specify the delete operation to use Equality delete or Positional delete ??
It’s mainly based on the engine, most engines will use position delete but streaming platforms like Flink will use equality deleted to keep write latency to a minimum
Great article Alex. Slight issue creating a view in Dremio, I get the following exception "Validation of view sql failed. Version context for table nessie.names must be specified using AT SQL syntax". Nothing obvious in the console output, any ideas?
@@AlexMercedCoder Thanks Alex. This would seem to be a limitation of the 'Save as View' dialogue, as it doesn't allow me to do this and it doesn't default to the branch you're in the context of currently.
If your following this tutorial sometimes Spark has some weird dns issues with the docker network. The solution is to use the ip address of the Nessie container which you can find by inspecting the network in the docker desktop ui or inspecting the network using the docker CLI to find the ip address of the Nessie container. If you run into a "Unknown Host" issue using minio:9000 then there may be an issue with the DNS in your Docker network that watches the name minio with the ip address of the image on the docker network. In this situation replace minio with the containers ip address. You can look up the ip address of the container with docker inspect minio and look for the ip address in the network section and update the STORAGE_URI variable for example STORAGE_URI = "172.18.0.6:9000"
how come iceberg can read csv file? I thought you can just use parquet,orc,avro. is it just work in dremio like vendor thing? because in trino, you just need to use parquet,orc,avro
The CSV file is not part of the iceberg table, in this example we taking a CSV file and adding the content of it to an Iceberg table but new parquet files are being written and a new metadata snapshot being created.
Hi, Can you send me the query regarding the update command because I am getting an error regarding this we cannot use update command is it true? Or any other command we can use?
Dremio is particularly designed for structured and semi structured data. Although in the future different AI tools can help turn unstructured data into structured data for analytics.
Hi, thanks for the video! I work in a small company, with 10 dasboards. Can I use Dremio as a centralized way to quickly access data without using DBT (i dont know dbt)? Dremio As a fast data lake that I can use SQL in my parquet files and various databases to create my dasboards? Again, thanks for the video.
Yes, dbf was just demonstrated to show the integration but it isn’t required for using Dremio. You can do everything via the UI. Here is an exercise to show you just that -> bit.ly/am-sqlserver-dashboard
In this video we are using AWS Glue studio but a docker container with a notebook server and am configuring the environmental variables in the docker run command. AWS glue is just being ci figured as the catalog in the Spark session. In AWS glue you should be able to specify env variables on the job settings page. Find examples here: github.com/developer-advocacy-dremio/quick-guides-from-dremio
Can you filter categorical variables via the heading like Excel dropdowns? If not, is that coming? You can in PowerBI and if you aim it at those people they may not want to SQL every filter action.
@@Dremio Thank you but could not find text to SQL on the local dremio client, is it exclusive to cloud? I feel like dropdown headers could easily generate SQL.
@@emonymph6911 yes, text-to-sql is an exclusive to cloud feature. Both cloud and software have no-code features that can be accessed by clicking on a column to generate calculates columns, joins, data type changes and others.
@@Dremio Thank you. My only feedback is that Excel style filters for unique names that are part of your column headers would be really convenience and nice to see in a future release. Apart from this I think the software is amazing, lots of respect for the team.
Hey Dremio team !! How can we programmatically ingest data in iceberg table built using CTAS in dremio? If I have already built a iceberg table in dremio, and now on a schedule or event I want to append rows from a file into this table using some program and scheduling tool like airflow, how is that achievable? Most of your demos show DML operations from the sql editor, but thats not the production way to go/
SQL is fine you can use airflow to send SQL to Dremio to insert records into the desired table. In this tutorial I give an example of doing an an append only insert. bit.ly/am-sqlserver-dashboard
Hey Alex! Nice video! Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?
@@Dremio It's exactly that article which made me ask the question. ^.^ Don't get me wrong I'm trying Dremio right now in local docker looks amazing. But I still thought Hudi with timeline is more suitable for BI considering dates ties well together with graphs, event streams and Data Vault methodology as well. Going to watch the Xtable presentation at subsurface, looking forward to it! PS: Alex your customer care videos and docs are the best in the world for a software application. I like how you guys go at a moderate pace and cover terminologies in the tutorial before showing the ropes. Makes it an easy barrier of entry. Please keep that up. 10/10 waves!
@@emonymph6911 I think this may be answering the opposite question but this article may be helpful too: www.dremio.com/blog/dremios-commitment-to-being-the-ideal-platform-for-apache-iceberg-data-lakehouses/ I do think there is a tremendous benefit to the reusability of Iceberg's metadata structure along it's partitioning evolution and hidden partitioning features which are unique to the format.
what a great tutorial. One thing that I didn't get is how did you just convert the string json object constructed by airbyte to get the columns with their values. Thanks in advance
In this blog you can see the sql in more detail, but essentially I turn the json string into an object and access the properties via the keys. www.dremio.com/blog/how-to-create-a-lakehouse-with-airbyte-s3-apache-iceberg-and-dremio/
Yeah thanks@@Dremio. I would like to ask a question. what if we want to use Project nessie as a catalogue for iceberg tables directly Is there any option for this !
Hi. I have been testing with Dremio OSS version 24.2.6. I have been looking into Dremio to find the solution of providing roles and privileges. However, its not available anywhere. Upon going through the documentation on Dremio's website it mentions this feature is available on Dremio v16.0+ Enterprise Edition only. My dremio runs on a docker container on a single server along with Nessie, postgres and Spark. In your video, I can see you are also using localhost. How did you manage to have privileges and access control? Is there anyway, I can do the same with the open source version? Is there any roadmap to include it in the OSS version?
How do we resolve merge conflicts Ex :- main branch moved a head and added/deleted somedata temp branch have some changes and i'm trying to merge temp to main branch how does nessie handle this case do we need to manually resolve the merge conflict
Nessie has the ability in its rest api to force merges or ignore certain objects. These aspects of its features should be coming very soon to its SQL support. In future iterations it will become more context aware to auto reconcile such conflicts further down the road,
Hey Alex, I'm also getting this error 1 of 2 START sql table model warehouse-dbt.test2.my_first_dbt_model ............ [RUN] 11:57:41 1 of 2 ERROR creating sql table model warehouse-dbt.test2.my_first_dbt_model ... [ERROR in 4.22s] 11:57:41 2 of 2 SKIP relation test2.my_second_dbt_model ................................. [SKIP] 11:57:42 11:57:42 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 11.60 seconds (11.60s). 11:57:42 11:57:42 Completed with 1 error and 0 warnings: 11:57:42 11:57:42 Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql) 11:57:42 ERROR: Validation of view sql failed. No match found for function signature my_first_dbt_model(type => <CHARACTER>) How to solve this error? Any help will be appreciated Thanks
From video, what I understand Nessie is the best catalog for datalake house. It is easy to manage and goes beyond the its capabilities giving gitlike environment.
Hi there, thanks for the awesome video! Any reasons why s3.endpoint was setup an ip address rather than a host name when creating the catalog? I found the hostname style could also work in the demo with s3.path-style-access=true.
This worked fine. The only detail I had to look at twice was the using the project id, which is a long GUID instead of the name of the project. The dbt-dremio plugin does not return meaningful errors if there are any problems. Another UI thing that was odd to figure through is that you have to click on an artic catalog then see the tiny button in the upper right of the screen to create a folder. No actions are possible from the folder tree on the left for Artic related functions. Those things made this less straight forward than it could have been, but it still ended up working great.
Hey alex, nice video. I have a question for you. when convert any file (csv, json or parquet) from a datalake to iceberg table, data will duplicate, will be a copy on the iceberg table?
Is scanning the QR codes on the phone not working, both videos should also be searchable on Dremio you tube channel. I think the next video in the first QR code is ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-bvXj4ANMy10.htmlsi=KrthtZQr_Dve9Ter
I'd have to see the whole log output and catalog settings to determine the issue. If you want message me on LinkedIn and I can examine further. - Alex Merced
hey Alex, really appriciate your work. I am quit beginner here and I have a basic question. why didn't you include DBT-Dremio in your docker containers ? why did you configure it seperately in a virtual python env. would really appreciate the clarification.
1. dbt-dremio needs to be installed in the context your dbt-models exist which is usually not on the same system Dremio is running on (you wouldn’t want both processes fighting over resources) 2. The virtual environment is to isolate dependencies from other projects like web apps, so that way I can more easily make the environment more replicatable.