Common mistakes in big data models

Подписаться 104 тыс.

Просмотров 64 тыс.

50% 1

What are the common mistakes in big data models for Power BI and Analysis Services?
In this live webcast, Adam Saxton, Patrick LeBlanc, Alberto Ferrari, and Marco Russo share their multi-year experience with many different customers and scenarios.
During the live event, you can post comments and ask questions in the live chat window.

Развлечения

Опубликовано:

15 апр 2021

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 63

@Nelmistro 2 года назад

To see all 4 of these people in one video is kinda like seeing the Avengers of Big Data. I've definitely learned a lot from these guys!

@Don_Modern_Ancestor 3 года назад

I've taken courses and consumed SQLbi and Guy in a Cube videos which are really good, but as a group you guys were great! We're starting strategy meetings for data governance, administration and modeling for a large enterprise system and appreciate your insights.

@stumit63_tech 2 года назад

Hear, hear!

@Furryfattal Год назад

Great talk gentlemen. What kinda put the bow on the whole thing for me was the impromptu demo of how "follow up" meetings are formed at the very end of the segment. That really hit home, so thank you again!

@Pooja-pd1ee 3 года назад

Loved the entire session. There's so much you are doing for the bi community. Keep up the good work, guys.

@clausm73 3 года назад

This is one of the best Power BI videos ever! Why? Because a lot of concepts are being touched and should lead the spectator to investigate further.

@kostaskolotouros4345 2 года назад

The Big Four of Power BI in the same panel! Thank you guys!

@bbbradl 2 года назад

Thanks! I hope you have more content on very large dataset issues & solutions! My take-away is "don't waste" and "Import whenever possible". In our case, the modeling team is using Direct Query to Views that sometimes have 50+ joins, that are non-Scalar (use ranges) that themselves are joined to others in the model. I'm just the front-end guy, but I've offered them the following suggestions: 1. Make all keys Integer. 2. Use in-memory Materialled Views for the most complex queries. 3. On SSAS, bring the most common tables into memory. 4. Place values most used as filters and parameters out of the Snowflakes (like Employee Status Code, Unit Code) and into their parent dimension tables, because they're always referenced and relationships are costly. However, the restrictions on Live Connect to an SSAS model (using the Analysis Services connector) are many. I struggle daily with how I can get by without being able to create columns or tables, how to join other datasets or models when needed, or in case I need to bypass their model and go straight to the OLTP, then how to preserve Row Level Security. A series would be A W E S O M E!!!

@jkschola1 10 месяцев назад

I've been tuning into both @SQLBI and Guy in a Cube's videos to enhance my Power BI Best Practices knowledge - what an epic crossover! 👏 Kudos to both of you for your valuable contributions! 👏

@shujashakir9952 Год назад

All masters together. Great!

@wexwexexort 2 года назад

Fantastic session, thanks!

@mikar636 3 года назад

Amazing!!! Thanks for sharing!!

@reynaldomalave 3 года назад

Thank you guys!

@sushilgulati Год назад

Thanks for this great conversation guys !

@ThanhNhandl 3 года назад

i love the way they discussed most

@anhphan9141 3 года назад

OMG my favorite Power BI People in 1 video. Love it!

@micahdail 3 года назад

Yes, 4 billion. ;-) 14:18 . I do want to add that to Patrick's point it is a balance of what's in your model and what you need. 4 billion worked very well regardless, but after internal discussions we decided to leave the less-frequently-used grains in direct query and free up some P2 space for other models. So we are back down to slim 1.5B on that one model which is enabling us to deploy more models to.

@rrestituti 2 года назад

You are freaking intelligent!!!!! I love your channels.

@happyheart9431 8 месяцев назад

Billion Thanks for sharing such a valuable knowledge

@joannageorge6628 2 года назад

Wait a minute, so you have ALL my favorite people in 1 video? I cant handle this :)

@artigasmoises 3 года назад

Excelente video, saludos.

@AnandDwivedi 3 года назад

amazing Discussion

@bojanjovanovic756 3 года назад

Thank you very much ! Your contribution is priceless !

@jjjj8058 2 года назад

great discussion. My issue with import over DQ is that you lose the centrally managed RLS applied at the source.

@Ajuneja99 2 года назад

Thanks for the great content. You guys talked about direct query being last resort in scenarios like latest data required, or client not willing to store data to Power BI. What I have seen with a client is having database side RLS as one of the reason of not using the import mode. Any thoughts around that or any other content I can go though related to this? I'm thinking now if there is any workaround to that.

@FedericoLov Год назад

Patrick! even if the load happens once a day stakeholders don't want to introduce any additional lag. The problem is yes it runs once a day but we don't know at what time exactly so setting up a refresh schedule becomes difficult. Also, partitions management in powerbi is still behind what we used to have in SSAS and incremental refresh is only available on premium.

@hmhkh 3 года назад

Hi Marco, Alberto, Adam and Patrick I am fan and follower for four of you, thanks for the informative discussion, we are telecom operator and we have tremendous amount of data from all our nodes, we may need a consultation from your side to optimize our data models. Our backend is SSAS tabular and our front end is Tableau.

@SQLBI 3 года назад

You can contact us on www.sqlbi.com/consulting/

@tarekdemiati9991 Год назад

It will be very interesting to have another discussion after the "new"Datamart feature will have been implemented in production with a few enterprise customers. It seems like the DataMart could be an amazing tool for handling multiple large data sources.

@NeumsFor9 2 года назад

Chris Adamson's Mastering DWH Aggregates is a good reference for solving these issues, but also Spark is not a bad alternative in some cases.

@zecosta9579 3 года назад

What was that term brought up a couple times, "filter dags" (e.g. around the 49:00 mark)? Have not heard of it before and I don't think I even understood the word correctly. Many thanks for the great video folks, gave me a lot to think about when developing my next reports :)

@SQLBI 3 года назад

It's "filtered aggs" meaning "filtered aggregations" - it's a specific optimization technique for SQL Server (materialized view, filtered indexes, ...) that could improve performance in a DirectQuery scenario.

@franciscoclaudio4818 3 года назад

Here in Brazil, analysts use Pentaho a lot in the processing of big data. Do you indicate a better tool?

@AnandDwivedi 3 года назад

Marco Alberto expecting a video on XMLA endpoints inc refresh

@tejatarigopula3266 2 года назад

Thank you for sharing real-time solutions for real world problems. Adam was referring to dictionary on a column. Can you please elaborate on this. Thanks

@SQLBI 2 года назад

Read this: www.sqlbi.com/articles/measuring-the-dictionary-size-of-a-column-correctly/

@anselmolopez8584 3 года назад

congratulation on the knowlage you shere, i need it and i will folow you...from Spain

@bhaveshpatelaus 3 года назад

We are talking about using Direct Query for large tables. However we need to remember that No more than 1 million rows are supported in the query. The visual is going to throw an error if it has to query 1 million rows. Need more settings to control how DQ behaves with source.

@SQLBI 3 года назад

The limitation is related to the temporary tables moved to the formula engine. You can have 10 billion rows in DirectQuery, but you have to be aware of the limitation depending on the queries. Very generally speaking, dimension size could be an issue in a limited relationship, but there are many other elements in play. But yes, this limit is definitely something we'd like to customize. It is customizable in Analysis Services, not in Power BI (yet).

@anilkumar-rg9jk 2 года назад

Do more on dax guys like this

@paradeen2730 3 года назад

I heard you talken about text search in large dimensions. At least the filter visuals I have tested couldn’t fold a SQL query supporting SQL full text index search. Is this your experience to?

@SQLBI 3 года назад

Yes, the best we've seen is using LIKE. However, *depending on the hardware*, the same search could be faster on a relational database.

@kananparmar9656 2 года назад

Hi Adam Saxton, Patrick LeBlanc, Alberto Ferrari, and Marco Russo I have been new to Power Bi where I'm trying to fetch the on- premise data into Azure cloud and then to visualized through Power Bi. I'm struggling here what azure architure to develop for Azure to fetch data from POS and what Azure components is to be used to get the project going and how this Data Modeling and relationship knowledge I have got can be implemented.

@rerangelt 3 года назад

This was a deluxe video.

@terryliu3635 3 года назад

I like the part “no cached” mode instead of “direct query”. Lol.

@hojatalaii8360 6 месяцев назад

How to calculate max consecutive negative number in a column by DAX?

@bhaveshpatelaus 3 года назад

As per the Analysis Services Documentation, The cardinality of the columns should not be greater than 2 billion rows. I think this is also applied for Power Bi. Imagine having ticket number in a fact table that you need to display in your report and have more than 2 billion unique values.

@SQLBI 3 года назад

The cardinality can be bigger, the issue is mainly related to performance when the column is used in a relationship. Another scenario is DISTINCTCOUNT, where the cardinality has a big role in performance. Approximate algorithms can be very useful there, take a look at this article: dax.tips/2019/12/20/dax-approx-distinct-count/ Absolute numbers also depend on the hardware, over time the "limit" can change because of hardware improvements. In 2012 1 million was a high cardinality, today 4 million rows could have reasonable performance on dedicated hardware (I would stay below 2m rows on cloud services, though).

@silviab5579 3 года назад

❤❤❤❤

@natnaelstesfagiorgis3077 10 месяцев назад

Advice on connecting with SQL

@netflixaccount3210 3 года назад

I laughed so hard at this comment of Kevin Hunt: Marco did you take all of Alberto's books?

@shafa7668 2 года назад

I think the issue is business requirements change over time and it is expected that the existing model should be able to answer the future questions. That's why is it safe to keep as many columns as available. So if you have narrow fact tables there is a risk that in future your model will not serve the purpose.

@richard-social8125 3 года назад

Summary?

@Karenshow 4 месяца назад

what about telematics data, that gets big pretty fast.

@VladMZ 3 года назад

Oracle's OBIEE enforce Star Schemas religiously. You simply think thrice before adding another attribute to your fact table. Let alone you CANNOT create standalone fact tables without any dimensions.

@rcb3921 3 года назад

"We have a small youtube channel"

@NeumsFor9 2 года назад

In companies with high turnover, it can be easier to start over as opposed to trying to decompose what other teams implemented. If you can't decompose it in a week, chances are you could be better off starting from scratch. The big companies are actually the worst offenders.

@gondebas 3 года назад

Wooow! Greetings! I got a question for you, big guys, on how do you welcome calculated columns in a fact table. Whenever the fact table gets fat enough, say 1 GB, the star schema does not shine anymore. What works for me in terms of performance and speeding up reports is to add calculated columns to the fact table. It is efficient not only if there is any complexity in the measure. I can notice the speed jump by staying away from such a simple function as SUMX to calculate Sales Value. Adding the additional calculated column to FactTable increases the total size of the report negligibly, so it is a relatively low cost. Even replacing measure SUMX with a pair of the calculated column of sales value and then using a measure with a simple SUM function increases the speed approximately five times. If the measure contains any grain of complexity, say the IF function, then the speed increase manifold. Am I missing something here?

@SQLBI 3 года назад

Details are important. If you have an IF, the calculated column is faster, but it would be better to compute the column in the data source or in Power Query, because a calculated column in a large table is not a good idea for other reasons (lower compression, higher memory consumption at refresh time, longer processing time). If you have a simple expression that can be pushed to the storage engine (like Quantity * Price using two columns of the same table), usually you don't see any performance advantage at query time from the calculated column.

@timianalytics7150 Год назад

What's that on Patrick's chest😂

@sopauljauck4163 День назад

I'm dealing with a view in our Data Warehouse that has 57 columns and 225 million rows of records. I also have another table with 3 billion rows of records. My refreshes drains the life energy out of me because they take so long. I know it's not a good idea to have everyhting in a single view/table. I'm here trying to find something to show the company that what we have is bad approach. Anybody out there can give me a summary and clean insight that I can pass on to our data warehouse team?

@SQLBI 19 часов назад

Implement an incremental refresh: you build your house every day when you go camping, but you don't bring the furniture. Your historical data is like the content of the furniture, you would not relocate every day, right?

@lanasedayanoch 3 года назад

PATRICK, could you be more specific? which BI tools encourage flat tables? How come you have learnt that only last week - you've been in the game long enough. Also, you say you have LEARNT that they "ALL HAVE RUN INTO THE SAME PERFORMANCE PROBLEMS" . That implies hell of a lot of empirical evidence, all within "last week", we all know how many datapoints one must have for each of those to make a valid conclusion. Evidence please. This is too far-fetched a statement.