Hello Will! Thank you for your effort! I would like to understand how this is handled in the real world. When aiming for robustness and "self-healing," isn't it common to process all unprocessed files, rather than just the file from the current day? For example, what happens if there was an issue over the weekend or something similar? Regarding this kind of logic: Is it typical to move processed files to a different folder structure, or is it more common to keep track of which files were successfully processed by writing to a control file? Are there any other common mechanisms for this? If you have any references or examples related to these questions, I would greatly appreciate it. Thank you so much for your response!
Hey Will! I appreciate your detailed walkthrough of the code. The practical examples on notebook usage, datapipelines, and scheduling were very insightful, mirroring what we'd do as data engineers. Thanks!
Great vid! Id love to see you do this with a SharePoint source, I use a lot of power automate flows to basically get my data into lists in a semi-structured way, doing this in data factory and pushing it out to business users as well as a PBI source would be my end goal
Thank you for the explanation using a practical example! Wouldn't it be more efficient and better maintainable to perform both steps in a single Dataflow 2.0 instead of generating the JSON files (pipeline step 1) and reading them out via the notebooks and adding the data to the end of a table (pipeline step 2)? In a Dataflow 2.0, the data handling would be omitted, the append functionality is also available there and you have everything in M code in one place (maintainability). Dataflow schedule can then be orchestrated by a pipeline as well.
Yes in Fabric there’s normally always 2 or three different ways and f doing something. in this video, I wanted to show the Notebook approach. It has the benefit that the JSON format is saved in raw, plus would be possible to test and validate (not really possible with dataflow) 👍
Hi Will! I've got a notebook setup to collect GTFS-RT (real time bus location and trip data) from a protobuff within Fabric. I had this successfully running on a schedule every couple hours, but realized I needed to start collecting it more frequently, every couple minutes, to do the needed level of analysis. However, it looks like the time it takes to deallocate and reallocate a spark session for the notebook is longer than the time between my scheduled runs. The solution might just be to have the data collection portion of the notebook run on a loop throughout the day, and then have the notebook be scheduled to run just once a day, but I was wondering if you had any other ideas, or if you know of another method for obtaining protobuf data in a Fabric lakehouse without the need for a notebook, and in tandem spark session? Thanks!
Hello will , I would like to connect to Microsoft Fabric using a copy activity to copy my collection, but I'm encountering this error. I likely have an issue with permissions, I suppose? Or perhaps I need to develop a private endpoint? I'm not sure. Thank you for your assistance
Can we pass a dynamic parameter of the folder created and pass it the notebook? e.g. the first copy data step in the pipeline ingested the data into an adls folder or unmanaged file of 2023/08/17, rather than recalculate the date folder structure in the notebook function, can we pass the created folder during the ingest to the notebook activity?