Nice one! At how much data processing do you recommend to start working on optimizing the queries? As in when will the effect really start to show an impact? 1gb of processing, 10, 1000 or 1tb?
That's a great question! I think it depends on your pricing model. For example, if you're paying on-demand for each query you may be inclined to start optimizing queries that are scanning whatever #GB translates to a reasonable cost savings. This will also depend on how often the query is being run - for example if a query scanning 1TB of data is only run once that's $5, but if you run it everyday that's a lot! I would try to pinpoint queries that are run on a regular basis and are consuming the most resources, but set some threshold based on the total estimated ROI. For example, (the number of times the query will be run each year) x (the total cost of the query) > $1000 (or whatever is significant to your business) Check out the pricing calculator if you're estimating query prices: cloud.google.com/products/calculator/
How do you know the "where" clause is performed before the join (5:17) ? Don't you have to do a sub query on the left joined (t2) table ? (also does this sub query be less performant ?)
You should be able to tell from the execution details :) In the first stage of the query, the workers will read from distributed storage and filter the data before shuffling and joining (which usually happens in the next stage). You can see the WHERE clause and the JOIN applied in the details of the stage to understand the order. You don't need the where clause in a subquery to filter the data before joining, but subqueries can be helpful if you want to pre-aggregate the data before joining to reduce the amount of data joined.
@@leighajarett221 so I don't need a where clause in subquery because everytime the where clause is being done at the distributed level ? Like everytime ? Or should I put a subquery to be sure ? (and if I put this subquery to be sure will I get bader performances ?)
Some these concepts could apply to other SQL database management systems (DBMS). Also some DBMS' are smart enough to apply some of these optimizations when generating the execution plan without having the user making changes to their SQL query.