Process Excel files in Azure with Data Factory and Databricks | Tutorial

Подписаться 201 тыс.

Просмотров 115 тыс.

50% 1

Excel files are one of the most commonly used file format on the market. Popularity of the tool itself among the business users, business analysts and data engineers is driven by its flexibility, ease of use, powerful integration features and low price.
This is why every data engineer out there should be to understand advantages and disadvantages of this format. The variety of different internal formats like XLS, XLSX, XLSB and XLSM and which tools to use in order to process those files effectively in the cloud.
Today I bring to you a quick introduction to the process of building ETL solutions with Excel files in Azure using Data Factory and Databricks services.
Code samples: github.com/MarczakIO/azure4ev...
Agenda
00:00 Introduction
00:25 Excel Business Justification
01:22 Excel Challenges
02:20 Supported Services
04:30 Data Factory Introduction
05:35 Demo Setup
07:13 Demo using Data Factory
13:36 Databricks Introduction
14:44 Databricks Setup
18:14 Databricks Demo - Reading Excels
20:55 Databricks Demo - Reading Excels using References
25:56 Databricks Demo - Workbook Metadata
28:05 Databricks Demo - Defining Schema
30:03 Databricks Demo - Defining Schema
32:53 Additional Options
Next steps for you after watching the video
1. Excel format in Data Factory
- docs.microsoft.com/en-us/azur...
2. Spark Excel by Crealytics documentation
- github.com/crealytics/spark-e...
Want to connect?
- Blog marczak.io/
- Twitter / marczakio
- Facebook / marczakio
- LinkedIn / adam-marczak
- Site azure4everyone.com

Наука

Опубликовано:

4 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 192

@AdamMarczakYT 4 года назад

As a force I habit, I keep saying Crealytics library, but in fact, this library is called Spark-Excel and was developed by Crealytics company. 😊

@satyajee9575 4 года назад

Great videos Adam 👍🏻

@AdamMarczakYT 4 года назад

@@satyajee9575 Thank you :)

@rengaray1 4 года назад

@@AdamMarczakYT Awesome as always. Thanks

@arpitbest 4 года назад

Great man ..

@santoshatyam1409 3 года назад

Hii while uploading excel and view the data it is showing invalid but extension is correct.please help me

@lonaosmani991 2 года назад

Very clear explanation and well organized tutorial. Thank you so much for sharing. Keep up the great work!

@HierImNorden 2 года назад

This video is amazingly informative and helpful! I really appreciate the production value you put into this!

@big-bang-movies Год назад

Awesome content Adam. Specially the demos are pretty helpful. Please make more videos covering other use cases using ADF.

@raviv5109 4 года назад

As usual simple & clear. I really like your videos Adam. Way you explain is so natural.

@AdamMarczakYT 4 года назад

I appreciate that!

@jatinderarora2261 3 года назад

One of the awesome tutorials on ADF and Azure Databricks. Thanks for sharing.

@AdamMarczakYT 3 года назад

You're very welcome!

@deepjyotimitra1340 3 года назад

This really helped me alot. We had to deal with lots of excel sheets with diff format. Thank you so much Adam for such an wonderful video.You are a star.

@AdamMarczakYT 3 года назад

My pleasure!

@ericjanssens3475 3 года назад

Hi Adam, as always this is a great presentation ! Thanks for posting these videos !

@AdamMarczakYT 3 года назад

My pleasure!

@shahid646 4 года назад

Most demanding solution asked by business for long. Thanks for sharing :)

@AdamMarczakYT 4 года назад

My pleasure! thanks!

@manishdasgupta2997 3 года назад

This fits my business case. Thank you so much for this to the point tutorial!

@AdamMarczakYT 3 года назад

You're so welcome!

@mersihaceranic9640 4 года назад

Thank you Adam for all your videos and contribution. It helped me a lot.

@AdamMarczakYT 4 года назад

Glad to hear it! Thanks for tuning in :)

@balajibp7548 2 года назад

Your ADF playlist is AWESOME 🙂 and make videos on real time scenarios. Thank you...

@321zipzapzoom Год назад

Nice and ble to learn the concepts!!Thanks Adam

@pdsqsql1493 2 года назад

Very Excellent Video, nice step by step tutorial.

@jahnavimurthy 3 года назад

Thanks for all your videos. They have been very helpful!

@AdamMarczakYT 3 года назад

Glad you like them!

@ngophuthanh 3 года назад

Excellent video. Thanks, Adam.

@AdamMarczakYT 3 года назад

My pleasure!

@choudhary25 3 года назад

Thank you Adam for all your video.👍👍👍

@AdamMarczakYT 3 года назад

My pleasure! Thanks for watching :)

@gastondemundo9822 Год назад

Awesome vídeo, thanks for sharing

@amjds1341 4 года назад

That's awesome. Thanks for posting

@AdamMarczakYT 4 года назад

My pleasure, thanks!

@carlosalonsocapilla4796 3 года назад

Overall, your videos are very good, but man... this video is really amazing! I really liked the way you explained everything from the introduction putting the current problem into context to the possible solutions. I hope you make more videos of this "real problems" style and how to solve them with the different tools that Azure provides us (and if it is related to data engineering better :p ) I congratulate you for the video, very very good.

@AdamMarczakYT 3 года назад

Thanks Carlos! I appreciate this more than you know. This is because I want to do few more tutorials in 2021 for 'pure knowledge' where I just cover the service and it's features, but later I want to do more and more real scenario implementations. :)

4 года назад

Awesome! Thanks for sharing!

@AdamMarczakYT 4 года назад

Thanks for watching! :)

@scsourav123 3 года назад

awesome tutorial Adam... Thanks for sharing..

@AdamMarczakYT 3 года назад

No problem 👍 my pleasure!

@ronsystems 3 года назад

Good job Adam.

@AdamMarczakYT 3 года назад

Thanks!

@RajanieshKaushikk 2 года назад

Very nice video 👍

@sharathkarthik7347 3 года назад

Quality content. Thanks

@AdamMarczakYT 3 года назад

Glad you think so!

@aniketsamant455 4 года назад

As usual nice video

@AdamMarczakYT 4 года назад

Thanks again!

@JuanGarcia-qy9dt 2 года назад

ufff! Awesome video, thanks a lot

@AdamMarczakYT 2 года назад

My pleasure!

@frclasso 3 года назад

Amazing!!!

@AdamMarczakYT 3 года назад

Thank you! Cheers!

@balanm8570 4 года назад

As always another awesome video. Thanks a lot for this video... Wondering how you were able to demo most of the azure services with pretty cool clarity and to the point !!!

@AdamMarczakYT 4 года назад

It's a gift! Thanks 😊

@shahid646 4 года назад

@@AdamMarczakYT I endorse Balan Comment. :)

@AdamMarczakYT 4 года назад

@@shahid646 Thanks a bunch :)

@sebastiencuber7088 4 года назад

awsome - thanks

@AdamMarczakYT 4 года назад

You're welcome!

@BijouBakson 2 года назад

Thank you

@abhishek8311 2 года назад

Hi Adam, I hope you're still monitoring this. First of all, superb video and has helped me in meeting some of my business requirements. One thing which I would like to understand is how can we load the worksheet name(eg: Cars, Planes etc) in a separate Excel or CSV file as record of data. Waiting for your response. Thanks

@prakashloganathan5726 3 года назад

Adam, Your contents are outstanding. If you get a chance. Could you please post a video on how to get lineage (likes of Informatica catalog, etc.,) from the Azure Data Factory pipeline?

@AdamMarczakYT 3 года назад

Thanks, noted, maybe in the future :)

@RajivGuptaEverydayLearning 3 года назад

Nice video

@AdamMarczakYT 3 года назад

Thanks

@NILSUNIQ 4 года назад

Great Video. Couple of queries though- 1. How to get all records for selective columns only using crealytics excel,say A:D. 2.How to skip some rows in crealytics excel (say skip first 4 rows but keep headers) as provided in pandas read excel parameter.

@AdamMarczakYT 4 года назад

Well, unfortunately spark-excel library is not as flexible and well rounded as pandas. For 1 just use example I've shown in the video by using ranges. For 2 check this github.com/crealytics/spark-excel/issues/65 not sure if they implemented it but it looks like it should be there.

@mohamedriyazdeen6563 2 года назад

Great Tutorial Adam. Spark-Excel installed on Interactive cluster and used in Development environment is working fine. When moving up to higher enviroments linked services created with Job clusters. How the Spark-Excel library gets Intalled in job clusters?

@salmanriaz5184 Год назад

Hi Adam, could you please make a video on ADF batch service? Your videos have been very helpful in understanding ADF. Thanks

@BijouBakson 2 года назад

At 10:30 you are selecting a table in the sheet. Is there an option for selecting more than just one table, i.e creating additional table datasets to reflect the number of tables in the sheet without recusing to say Databrick? Thank you

@cosimocuriale8871 3 года назад

Great video Adam, very simple and clear. However, is there a method (library as crealytics) that allows to save a csv file without being partitioned? Thanks a lot!

@AdamMarczakYT 3 года назад

You can use 'coalesce' or 'repartition' functions and specify 1 partition. This will end up with 1 partition file that's called something along the lines part0000.csv which you can use later on. You can also then use scala to rename that file.

@solanavargas1284 Год назад

Great video Adam! So, isn't it possible to use files with xlsb extension?

@sudarshant2340 Год назад

Hi Your video is awesome I have a question, how to schedule each sheet at some time..can you please post a video regarding the same..

@chrisretsin7068 3 года назад

Very nice tutorial, would you consider these activities as IT only or do you consider databricks as something the business could setup? The business is using currently R only locally, but would like to take advantage of the azure (spark) environment. Any considerations or advice on our journey? Thx

@AdamMarczakYT 3 года назад

I'd say platform setup should always be done by the internal IT team or IT vendor. But then you can grant them access and teach them how to use it :)

@shubhammahajan9117 2 года назад

Just a small question. If I make changes to underlying excel data, will this pipeline work? I want to connect my Excel file to the Azure SQL database and I am using this video for reference. I want to have an updated Azure SQL database whenever there is a change in connected Excel data.

@nikhilnikam5077 Год назад

Hi Adam, Thanks for the content. is there a way to automate and create a job / task to add excel data in Azure database. Thank you in Advance

@srinivasdevarampati6375 4 года назад

Great video Thanks Adam. While reading list of sheets getting error : value sparkcontext is not a member of org. apache. spark. sql. sparksession spark. sparkcontext. hadoopconfiguration Thanks.

@AdamMarczakYT 4 года назад

After installing library remember to do import and detach & reattach notebook.

@christofherdelgado177 Год назад

Hi man this video helped me a lot! Hey is there any workaround or alternative in keeping an csv or excel file updated in the azure container? Imagine a pipeline -> Source=excel -> Sink=SQL Database, and that excel file has to be updated each day with new info

@piesogrodnika572 Год назад

Adaś, powiedz mi proszę co trzeba zrobić, żeby mieć takie poszewki na poduszki :) P.S. Świetna robota - w szczególności cały cykl filmików o ADF

@Ulfhedan 6 месяцев назад

how did you create the demo container to load the files? was this in a previous video.

@prashantpatil1260 Год назад

The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003) how do you handel it? failed while creating connection to DataLake with Excel 5.0

@vamsikrishnakilambi 3 года назад

Hi Adam, is there a way where we can write all the data from dataframe. I have millions of records and while writing in .xlsx format it is only writing max rows which one excel sheet can handle. It should split and write all the rows right like how it does for . CSV?

@AdamMarczakYT 3 года назад

You need to write this logic yourself. You can also try Pandas with Python maybe it has more options too.

@ChallusMercer 4 года назад

Thank you for your effort on covering this topic Adam! I have a question - what if i have a customer database running on premise on his machine. Does microsoft offer a tool for exporting data from the database and uploading this data for example to a data lake or what ever location in the cloud for processing this data with data factory and so on? What are the common steps in this case?

@jgowrri 4 года назад

Install data gateway to extract on premise data and data factory to load into data lake .. hope this helps

@AdamMarczakYT 4 года назад

@@jgowrri is absolutely correct. Except for clarity, data gateway for data factory is called Self-hosted Integration Runtime, not to be mistaken with other Azure service called on-premises data gateway which is used with other services. That said, Integration Runtime with ADF should be used If we are targeting coterminous syncing scenario, i.e. co-existence of both databases for certain period of time. If you are migrating to the cloud as a one time process then maybe you should look at Azure Database Migration Service instead :) Hope this helps. If you want to check integration runtime I already have a video on that. Good luck :)

@AdamMarczakYT 4 года назад

@@jgowrri ps. One year ago when I started I wished to grow community to the point where members will help each other. You made my day mate :)

@alexfridi8663 4 года назад

great explanation! Thank you! Is it possible to use excel native formulas to change the content with Databricks?

@AdamMarczakYT 4 года назад

not really, only excel understands and executes excel forumals dynamically, for other tools like databricks, data factory, it's just a text with a value

@alexfridi8663 4 года назад

@@AdamMarczakYT ok, theoreticaly I can read the values and calculate and write back as you shown in the video. No idea how to use random function with databricks. The requirement is to generate random values and write it back in the same excel cells.

@terryliu3635 3 года назад

Thanks Adam. Do you know if there is a way to detect if the Excel has been updated on the SharePoint and trigger the ADF pipeline? Currently we’re using Logic App but not sure if we could avoid using it? Thanks.

@AdamMarczakYT 3 года назад

Check out my Azure Data Factory Triggers tutorial, it shows how to trigger ADF with logic app and logic app are amazing for triggering and moving files from sharepoint.

@terryliu3635 3 года назад

@@AdamMarczakYT thanks Adam, much appreciated!

@AI-Health-posts 4 года назад

graet video Adam thanks. is there a way to connect excel files at sharepoint online to data factory. thnaks

@AdamMarczakYT 4 года назад

You probably could move them with either data factory or logic apps to blob first. Process them and then transfer them back. This would be the safest approach. Other approach involves using Logic Apps Excel connector for sharepoint for editing, but I discussed my concerns about it int he video.

@SuperJamu 2 года назад

Is there a way to copy multiple sheets in data factory? In databricks I can see how to do. A for or while in .option(“dataAdress”, “myVarHere!”) can do it. But how achieve this in data factory? WIth parameters?

@kanishkkashyap4662 2 года назад

Hi Adam, Could you please help me to make some column as read-only while writing to excel format using Crealytics spark-excel library

@giancarlosql2005 3 года назад

Great Video Adam! One question, do you know a way to read XLSB files in Pyspark? Unfortunately in Pandas it seems it requires a local path and my datalake path is not working :( Do you know a way to read XLSB files in databricks or data factory? Appreciate any feedback you can provide, Thanks!

@AdamMarczakYT 3 года назад

I'm pretty sure I've tested pandas on databricks with datalake path previously and it worked.

@sureshpallapolua8 4 года назад

Awesome Explanation. Thank you Adam. can you please explain how can we load dynamically multiple excel workbooks and each workbook having multiple sheets. if possible please provide sourcecode in github. THank you!.....

@AdamMarczakYT 4 года назад

Thanks. Dynamic multiple sheet demo was shown in the video so just watch it until the end. But I can't provide you with source code as I don't have any samples other than the one attached in the video description.

@amitgulhane8519 9 месяцев назад

Can we use this same functionality in Azure Synapse notebook?

@davidcardenas4266 3 месяца назад

Great tutorial! Is tjere a way to use in pyspark? I tried but not succeeded.

@pawanreddie2162 3 года назад

How to load multiple xlsx files with same folder path at a time into databricks using pyspark?

@ravitiwari6335 4 года назад

Hi adam, I was using Mapping data flow in ADF, and some how facing challenges as I am looking for an aggregate function like collect but it should collect only distinct elements, which is not possible as collectdistinct expression function does not exist, can you please suggest how can I implement it.

@AdamMarczakYT 4 года назад

Sounds to me like you just need standard aggregate action. Why would you need collect in this case?

@ravitiwari6335 4 года назад

@@AdamMarczakYT Thanks for your reply.can you guide which aggregate function. ?Because collect brings all rows of a column2 associated with the unique value of column1 which is placed in group by. Collect is the expression function inside aggregate transformation, but I need a function that does collect distinct.

@canadatorontovideos7283 4 года назад

Hi Adam, is there any suggestion to do testing/validation of datas processed in azure data lake ?

@AdamMarczakYT 4 года назад

There isn't any service that does this out of the box. So just like in good old days you need to write this by yourself. I tend to do this in databricks as notebooks.

@canadatorontovideos7283 4 года назад

@@AdamMarczakYT thanks Adam

@SpeedyMechnic 4 года назад

I've got the need to run a SQL report that produces a few tables, one of the tables has around 300 million rows, I then need to do a SUM() on one of the columns. Should I be using data bricks? What can do this, I think writing out to a csv would be inefficient.

@AdamMarczakYT 4 года назад

It depends where the data is. But I don't understand how this is related to a video about excel processing.

@carlosjdesouza000 4 года назад

Hi Adam, Could you make a video explaining how to copy data from mysql table to delta lake storage with data factory? best regards my friend.

@AdamMarczakYT 4 года назад

You can use mapping data flows to export to delta lake. docs.microsoft.com/en-us/azure/data-factory/format-delta Unless you mean data lake, which is different from delta lake.

@Cristian-ek7xy 3 года назад

Thanks for the video. How do you automatically test this?

@AdamMarczakYT 3 года назад

Excellent question without any good answer I'm afraid. I didn't found any good tool/pattern for testing Azure Databricks notebooks :( I typically just write small notebooks to test other notebooks (similarly to how you write unit test) but that's about it.

@uday20101 2 года назад

Can I compile Tables in one excel and automate it to do this on a daily basis

@dev09able 3 года назад

Adam, is it possible to load data to on prem db using ADF ?

@AdamMarczakYT 3 года назад

Yes as long as your Self-Hosted Integration Runtime is installed in a local network (or extended network with Azure).

@jaimis3639 3 года назад

Is it possible to skip rows in Azure data factory, when reading Excel files, similar to what you showed in Databricks?. Typically business reports have informational headers that are not part of the data

@AdamMarczakYT 3 года назад

You can use range to specify starting row A100:X1000

@mominmushtaq6444 3 года назад

Hey Adam, this was an awesome video !!!. Keep posting videos like this ... I have my project requirement, where I need to get MySQL Database Data in an on-going basis. We have two scenarios while getting the Data from MySQL. 1. First time Copy - Where we will get all the MySQL into Azure Synapse . For this we planned to use ADF to first store data in ADL gen2 and use polybase to store data into Azure synapse. 2. Incremental extract - Where we need to get updated data near real time for which data has updated in MYSQL . Do you have any suggestions for implementing the above 2nd scenario in near real time?. Thanks for your support.

@AdamMarczakYT 3 года назад

Near real time scenarios require typically some tool that can perform real time replication based on transnational logs. But I don't know MySQL nor tools like that s I can't help here. If near real time means ~10min maybe simple queries and jobs every 10 minutes are enough with some metadata driven approach. Thanks for stopping by.

@mominmushtaq6444 3 года назад

@@AdamMarczakYT ~10 to ~5 mint will be fine too can you suggest how to perform sync data from mysql to synapse with datafactory?

@MrAconfee 3 года назад

Hello! Does this library have other dependencies? I'm doing the simplest case possible, your first example, but getting an error when I try to do anything with the dataframe: "Could not initialize class org.apache.spark.rdd.RDDOperationScope". Any clue what's going on here? It seems like a bug with the library.

@AdamMarczakYT 3 года назад

All requirements are listed in the video. Just check if your cluster's spark version matches library version. You can also check the details on their website.

@VijayGupta-ni2hm 4 года назад

Hi Adam can we do incremental from Data flow ...

@AdamMarczakYT 4 года назад

Incremental load is really technology agnostic topic, because it's about figuring out technical + data level information as such there is quite few options do it. Check out this doc for some examples docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview once you figure out the way you want to go then data flows should be easier to set up.

@joyyoung3288 3 года назад

thanks, can it be implemented on aws databricks? seems not ?

@AdamMarczakYT 3 года назад

It should be possible. Databricks is a multi-cloud platform and most features are available when it comes to data movement and transformations.

@Charango123quena 3 года назад

how would you pass the file name as a parameter? for eg we get filenames with the format .. data_20200511.xls where the date component changes in the file name

@AdamMarczakYT 3 года назад

Check out my ADF parametrization tutorial ru-vid.compISBgwrdxPM use that to pass parameter to databricks, in databricks use widgets to get parameter value

@e-zuan2687 2 года назад

i have problem at data factory as it say no github. How i can encounter

@JackPickle 3 года назад

It might be me Adam, but starting this demo now as a total Databrick newbie, some of the commands don't work with the current runtimes available (6.4, 6.6, 7.0, 7.1, 7.2, 7.3 and 7.4) for Crealytics. NB 6.5 does not exist for me. Trying both the supplied scala excel libraries depending on Scala version yields varying results. For example, using 7.4 and 2.12:0.13., no commands run in the workspace. Using 6.6 and 2.11:0.13., most do until I get to the worksheet looper. If it's something I've done wrong, then apologies, but if my assumption is correct - does the syntax for the libraries change so much between runtime versions?

@AdamMarczakYT 3 года назад

There's no difference in the language but on Spark 3.0 library probably had some issues. I probably would just install latest package like com.crealytics:spark-excel_2.12:0.13.5 (always check Maven for latest releases). I tested the code on 7.3 runtime with this package and it run with no problems. I ran entire script on 7.3 but also 5.5 with no issues at all.

@JackPickle 3 года назад

@@AdamMarczakYT many thanks Adam, I’ll give it whirl first thing. Next thing on my list is to parametrise things like keys and then export the file to an azure sql db. Great video though and really informative

@AdamMarczakYT 3 года назад

Make sure to check my tutorial on Databricks Secret Scopes ;) best of luck!

@AxL28AxL 3 года назад

Is it possible to use Azure Data factory to sink data to an Excel file?

@AdamMarczakYT 3 года назад

Not at this time :( Maybe in the future MS will add this support docs.microsoft.com/en-us/azure/data-factory/format-excel?WT.mc_id=AZ-MVP-5003556

@bideveloper357 3 года назад

Adam,. Can you make a series of databricks tutorials?

@AdamMarczakYT 3 года назад

Maybe in the future, yea, it's a cool idea :)

@bideveloper357 3 года назад

@@AdamMarczakYT Azure simple basic activities in adf we can understand using Microsoft docs. Please build some complex pipelines or real time projects pipelines. Also please include limitations of activities band work around for that. Like Lookup activity works for only 5k rows.

@balaramtupili 3 года назад

Hi Adam, very informative video. I'm facing an issue when printing data even if I defined Custom Schema. RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Some is not a valid external type for schema of string Caused by: RuntimeException: scala.Some is not a valid external type for schema of string

@balaramtupili 3 года назад

It resolved by using a new version of the library.

@AdamMarczakYT 3 года назад

Cool! Always keep your libraries up to date :)

@mokshithvsharma764 3 года назад

@@balaramtupili which library did you use. Could you mention it here.

@shivanidubey1616 4 года назад

Sir very helpful.but if I want to load multiple xl file having multiple sheet .how we will load multiple Excel file having multiple sheet

@AdamMarczakYT 4 года назад

Multiple sheet scenario is shown in the video. Multiple files is easy but not in the video. There are plenty of examples on the web/blogs/forums so you can try checking them out.

@vidyasalimath6177 4 года назад

I was using Excel format in Dataflow as a source and faced issues while data previewing and selection of sheet name with space.kindly let me know if these are supported now

@AdamMarczakYT 4 года назад

The bug persists but it's very easy to work around it. As the error message suggests, click on edit to put name of the sheet manually and use single quotes around it. Example: 'My Sheet'. As a result preview data button on the dataset will stop working but data flows preview and flow itself will work just fine.

@vidyasalimath6177 3 года назад

@@AdamMarczakYT Hi Adam I have a query can we refer excel file as a wild card path in mapping dataflow,if we have a filename + date.xlsx and date will be dynamic so still can we refer this sheet with tabs.

@shyamthakur9799 3 года назад

Great video but you have not shown with xlsb file format..!

@gobigpoker 3 года назад

@22:05, I keep getting this error: RuntimeException: scala.Some is not a valid external type for schema of string. What do you think might be causing the issue?

@AdamMarczakYT 3 года назад

Unfortunately not from top of my head, sorry. :( My guess is you defined schema for table and mismatched it with the file contents.

@99vi88 3 года назад

I solved this problem using a cluster with 6.4 Runtime and com.crealytics:spark-excel_2.11:0.13.6 library.

@SH-qt4ro 3 года назад

@@99vi88 Cool , Thanks Vinicius Pivetta. I tried multiple times with different option - but was getting similar errors. 6.4 Runtime did the trick (6.4 Runtime and com.crealytics:spark-excel_2.11:0.13.6)

@rohitkulkarni9038 3 года назад

Is it possible from Powershell can i copy the source table from SQL server to one of the container in CSV format. Please let me know any video releated this Thanks RK

@AdamMarczakYT 3 года назад

You can but you need to write the script yourself. There is no out of the box ready script for you to use. Unfortunately I don't have a video covering this topic.

@rohitkulkarni9038 3 года назад

@@AdamMarczakYT: Please let me know the link if you have for Custom Activity please share it

@joyyoung3288 3 года назад

install spark-excel seems to be ok, but the error message: NoClassDefFoundError: Could not initialize class com.crealytics.spark.excel.WorkbookReader$at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:28)? anyone can help?

@AdamMarczakYT 3 года назад

What's your cluster configuration?

@zamarinen Год назад

to master databricks is my goal, but damn seems to be a long way there...

@ncbshiva 3 года назад

Hi Adam, Thanks for the Videos, I am following you for all Azure related I created Databricks as shown in your video, but i am facing below error. I have installed both Scala version 2.11 and 2.12. "java.lang.NoClassDefFoundError: Could not initialize class com.crealytics.spark.excel.WorkbookReader$" Could you help me ?

@AdamMarczakYT 3 года назад

Hard to say which step did you miss. Did you import the library as per video? Try detaching and attaching notebook too.

@ncbshiva 3 года назад

@@AdamMarczakYT Yes, i had imported both the libraries that you have mentioned.

@AdamMarczakYT 3 года назад

I'd try redoing the steps from the beginning. Maybe you missed some step. Try restarting cluster too.

@SuperJamu 2 года назад

And how to read a xlsb file?

@rishabhchaurasia311 3 года назад

error : NoClassDefFoundError: Could not initialize class com.crealytics.spark.excel.WorkbookReader$ using com.crealytics:spark-excel_2.12:0.13.1 for scala 2.12

@AdamMarczakYT 3 года назад

Hard to say, you must have done something differently :( try doing the demo again.

@AA-kq8on Год назад

can we use Python in Databricks????

@ThoughtDiffusion 3 года назад

Hi, I have one question or I would like you to prepare one video on the senario I am putting here. Lets say you have bunch of dcouments in folders or hirachy of the folders , You have one excel file which does contain the metafata of the all documents within the folders. Excel sheet have , document title , ducument type, document created date , document path of folder where its stored in folder, So basically excel sheet is storing all reference entigrity of documents and metadata. This entire source directory how would we upload the each documents in same in Azure blog storage as blobs , also each blob should have metadata to be added , and each blob is stored in pertificular folder in blob storage, folder path is given in the excel sheet for reference. |How would we do this using Azure factory pipe line flow

@AdamMarczakYT 3 года назад

I'd write Databricks notebook for this. This logic is too complex to do that in ADF.

@ThoughtDiffusion 3 года назад

@@AdamMarczakYT thanks for responding , I really appreciate you and your all vidoes which are very helpful. Could you please suggest any easy way how could we move/copy set of documents to the Azure blog storage with some metadata information ? lets we have set of documents in local machine or One drive, and have another excel file which has document reference and metadata information (few more columns), how would be migrate it to azure bob with document and its metadata ? would it be MS flow ? would it be ADF? would be APPS logic ? Would it be any other way you think ? and how it would be ?

@sid0000009 3 года назад

Hello Adam, how we can archive an excel file as Excel are supported as Sink..Any tips ..Thank you ( Reference to Azure Data Factory )

@AdamMarczakYT 3 года назад

I probably would use Databricks with Spark-Excel using Scala or better yet Pandas using Python.

@sid0000009 3 года назад

@@AdamMarczakYT we have existing pipelines in ADF and just want to plugin the archiving part

@AdamMarczakYT 3 года назад

Well, ADF can't do it. You need to employ extra tool. In my opinion use ADF to output CSV's and then call databricks to convert those CSV to Excel. Should be cheap since there is no logic just conversion.

@sid0000009 3 года назад

We found binary file format to be working for moving any file formats including excel..might be helpful to someone looking for similar use cases. Thanks..

@jagerzhang4059 3 года назад

@formatDateTime(trigger().startTime, 'yyyyMMdd') Adam how this work for output path, eg like output/2020/12/01 folder to save file

@jagerzhang4059 3 года назад

I would like to copy the data every data, to distinguish the folder by date ,how could I create folder daily, like output/2020/12/01 , output/2020/12/02 etc

@AdamMarczakYT 3 года назад

Use format in the second parameter like so @formatDateTime(trigger().startTime, 'yyyy/MM/dd') then use concat @concat('output/',formatDateTime(trigger().startTime, 'yyyy/MM/dd'))

@ravipaul1657 3 года назад

When will next episode coming 😫

@AdamMarczakYT 3 года назад

Episodes are coming out every week, sometimes two weeks, why?

@jeevannr5980 3 года назад

such a click bait! you did not mention any way to handle xlsm or xlsb, I just wanted that!

@AdamMarczakYT 3 года назад

Just because video doesn't have every possible detail explained it doesn't make it a clickbait. Also if you would watch it you would see that I did show how to process XLSM files and that XLSB is not supported and if you need XLSB then use pandas with python.

@sayanm7750 3 года назад

Jeevan NR - sad to see your disrespectful comment to Adam who is helping the community voluntarily...and hope you got a chance to notice the grace with which he replied to your complaint. Thanks!!