You don't always need JOINs

Подписаться 38 тыс.

Просмотров 118 тыс.

50% 1

To learn more about PlanetScale, head to planetscale.co...!
You don't always need joins, and sometimes it's faster to not use them 🤯
💬 Follow PlanetScale on social media
• Twitter: / planetscale
• Discord: / discord
• TikTok: / planetscale
• Twitch: / planetscale
• LinkedIn: / planetscale

Опубликовано:

1 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 391

@anykeyh 10 месяцев назад

Subqueries and Inner joins are often decomposed the same way by the Query Planner. In your first example you do a left join, not inner join, which is more difficult to optimize by the planner. I'm not too knowledgeable about MySQL, but on PostgreSQL I can tell you that subquery or join will behave the same way. Also decomposing the query in two queries is not much slower than using subquery, in this case, and you just add the query parsing and the network IO, which is fixed time overall. At the end, the subquery ID will be used with Hash or BitmapScan strategies, which are the same strategies if you use two queries. SQL optimization is a rabbit hole, beware it goes quickly very deep :)

@MattiasFlodin 10 месяцев назад

Inner joins are more difficult to optimize because the search space of possible execution plans is greater. But the availability of more options also means that the best possible plan is potentially better than with a left join. So if you're in a situation where both will work semantically, it makes sense to profile both variants.

@undefined6341 10 месяцев назад

That's all nice and fine, until you start running into optimizer bugs. The video should have started with a caveat about such possibilities.

@aliren6118 10 месяцев назад

Exactly. Just write idiomatic SQL and let the internal execution plan take care of the optimizations. No need to be super clever when it comes to SQL. If the query is "par for the course" the engine is coded to optimize your stuff.

@spacemanmat 10 месяцев назад

Please don’t break up queries into separate pieces if at all possible. Currently working on 2 performance issues where people thought pulling back the records and doing in memory checking was fine. Worked for a few years till it mysteriously started randomly failing no one knew why and of course they could never get it to fail in development. I tracked it down and figured out that in production the query brings back 40k records in one case, turns out it only needed one of them.

@artephank 10 месяцев назад

@@MattiasFlodinis it really? If you are joining smaller table and condition join on same parameters you would filter subquery - what is the difference really? My totally anecdotal evidence based on working with Oracle db tells me different story - subqueries if anything, usually are slower .

@GameDesignerJDG 10 месяцев назад

If this is definitively the better option, why wouldn't the MySQL dev team just add the same optimizations here to the JOIN clause? This is a wonderful piece of advice, but I'm walking away from this more confused than enlightened.

@Alex-lu4po 10 месяцев назад

👀

@qub1750ul 10 месяцев назад

The video is a bit clickbait, it gets more honest in the end. The pattern presented isn't necessarily better, it really depends on the data at hand. Unless the query actually computes and throws away a lot of data the join version would be more optimizer-friendly and thus better in general.

@joshr96 10 месяцев назад

Yeah at the very end he hits on the main reason to use a subquery over a join. You want to filter on data from another table but don't need it in your final result set. (In this video that would be a subset of users based on the posts data, but we don't care at all about any columns in posts for our output from the query).

@minilathemayhem 10 месяцев назад

The optimizations that apply to IN and EXISTS aren't the same optimizations that apply to JOIN, mostly. The JOIN itself is doing something completely different from IN and EXISTS, and that "completely different" thing is why the "duplicate" records exist. So, one of the things that's adding overhead to the JOIN over the the IN and EXISTS queries is the distinct keyword. It's telling MySQL it has to go through the result set, compare the records, and only return a single copy of any record that's repeated multiple times. The IN and EXISTS queries don't have to do this at all.

@omar_5622 10 месяцев назад

I think the message here is to be open to breaking free from the default-approach. It's always good to experiment with and benchmark more than one one query before deciding what to commit to the production code. I usually do that whenever the query time is a concern and in some case the query that proved to be better was the least expected.

@gilbes1139 10 месяцев назад

If MySQL is not producing the same execution plan for the IN and EXISTS queries, there is something seriously wrong with it.

@_MB_93 10 месяцев назад

Not just MySQL, other engines like MSSQL also differs in IN and EXISTS, IN can't deal with NULLs, always use EXISTS when filtering from another table to avoid confusion.

@proosee 10 месяцев назад

@@_MB_93 SQL is declarative, so it should produce the same execution plan because there is literally no case where this should produce different results given that IDs can't be null (and RDBMS knows it). Although, if column under discussion is nullable then it can produce different query plan and it's totally understandable.

@arirahikkala 10 месяцев назад

In my experience they're quite often different in PostgreSQL, for good or ill. Usually EXISTS is the correct choice. I have seen one case in my career so far where IN happened to not just mean the right thing but perform better. The query was a huge complex mess with more than one bit throwing the query planner for a loop, so I'm just going to call it a curious exception to the rule.

@robsonfaxas6574 10 месяцев назад

Bro, I replaced once IN with Exists in a 2-hour-long query and reduced it to 6 seconds. His explanation explains why Exists sometimes can be faster than IN. However, if you have a long FIXED LIST returned, IN can perform better than Exists, due to the fact it's going to create the list ONCE and reuse the same list. Exists, on the other hand, will always run line by line when connected with the ID of one or more tables from the outside.

@jailbaitingjayb 10 месяцев назад

@@prooseenonsensical statement. Barring bugs changing execution plan does not change result and any single statement can have 1 to near infinite amount of query plans.

@Raftor74 10 месяцев назад

As far as I know, the WHERE IN construct has a limit of 65536 values (the exact value depends on the specific DBMS). Isn't it better to use INNER JOIN with a subquery inside?. There is no such restriction.

@mme725 10 месяцев назад

I wasn't worried about this, but now I'm curious. This is why I love these videos. Even if I knew something already, the comments also provide interesting insights.

@spicynoodle7419 10 месяцев назад

I think in MySQL `where in` has a limit of 10k items

@ronsijm 10 месяцев назад

@Raftor74 I wonder if that's also true for sub-queries. I've seen the same issue when doing it in a two step approach - pulling the IDs in first, and then constructing an IN query - But I'm not sure about actual "IN sub-query" queries.. @PlanetScale ?

@dipteshchoudhuri 10 месяцев назад

For default postgres, yes it is.

@SXsoft99 10 месяцев назад

yes but you can bypass it by encasulating your where in in a where and add or where in

@rysw19 10 месяцев назад

One thing I would add is that the more complex your queries get with multiple or even nested subqueries, the harder it is for the optimizer to follow your code and apply the appropriate indexes. I’ve definitely had situations where I turned a subquery along these lines into a join and it changed the execution plan to find the correct index and run much faster. And it can change on a MySQL version to version basis. In general, the more complex your queries get, you will usually want to fall back to the joins because it doesn’t rely on the optimizer to be clever in order to find the correct indexes.

@po6577 10 месяцев назад

@jonasjonaitis8571 yeah both are worth considering.

@Voidstroyer 10 месяцев назад

@jonasjonaitis8571 This will definitely negatively impact performance since you are going over the wire multiple times. Keep in mind that under the hood, a database query issued from your applications still just sends a tcp request and all of the encoding, packeting, etc that is involved in that.

@boccobadz 10 месяцев назад

In real-world apps (eg telco/location data with billions of rows per day) subqueries fall apart pretty quick. If you're working on toy datasets (

@Voidstroyer 10 месяцев назад

@@boccobadz totally agree. And given that each method returns the required data in less than 1 ms, optimization at this point is a waste of time. I mean, sure you can say that the left join method was 2x slower than the others, but 0.3 ms vs 0.7 ms is a very negligible difference.

@timseguine2 10 месяцев назад

@jonasjonaitis8571 Subqueries are pretty much always going to be faster than running separate queries, since in the worst case the subquery approach will do the exact same thing except not round-trip the useless intermediate results over the network and serialization logic.

10 месяцев назад

I would like to see benchmark on a large dataset. Optimising for 100 rows dataset is waste of time as results may be much different. Depending on engine join may have aligning overhead (partitioning both tables by hashes and sorting) which will increase time on small dataset and can drasticly increase speed on large sparsly matching datasets

@Tjommel 10 месяцев назад

I came along with the same situation with invoices and customers (both tables have >10k rows). The speed of subselects are better and the data overhead is much smaller. Like he sad in the end of the video.. if you just need data from table A related on table B... do a subselect.

@jordancobb509 10 месяцев назад

Totally agree. If your tables only have 100 records in them then SQL is not even going to use the indexes anyway so it largely doesn't matter how you query the data because it's always going to use a full table scan and be "fast enough".

10 месяцев назад

@@Tjommel for me large table is something that dont fit into single worker memory

10 месяцев назад

Ofc if you work with time critical applications than you will try to squezee every milisecond of each stage of your app.

@complexacious 10 месяцев назад

I just did a quick and dirty test on a table with 100,000 rows matching against a subselect of 1.5mil rows and the join is ~1.15seconds on average and the exists is... 3.4 seconds on average. A group by instead of distinct gives me the same 1.15seconds. I think the moral of the story is don't listen to blanket advice from someone selling you services billed by the query, profile your code and find out which is faster for you and if you've got time see if it's faster because of a mistake that's fixable.

@IsaacClancy 10 месяцев назад

Alterative title: don't use left join when you want inner join

@aoe4_kachow 10 месяцев назад

1. You want users who have popular posts, so you should use join instead of left join 2. You can add the view conditional to the join ON part 3. You could also get all popular posts and then get the distinct author ids and then join that with the user table

@Pekz00r 10 месяцев назад

Yes, I was thinking 1 & 2 as well. Inner join and moving the conditional to the join would probably be a lot more performant. I'm not sure what you mean in 3. Do you mean to run them as separate queries? He talked about that in the video.

@minilathemayhem 10 месяцев назад

Just curious, why'd you use a left join over an inner join? Shouldn't the inner join be more performant than the left because it's going to knock out any user records that have no posts (assuming you have users in the database with no posts)?

@oOShaoOo 10 месяцев назад

I do agree, a inner join in this case seems more appropriate

@PlanetScale 10 месяцев назад

Yeah inner join would've been more appropriate for the real case, but this sample data wouldn't have changed. Just an oversight!

@lawrencepsteele 10 месяцев назад

This lines up with my approach. I tend to use joins only when I need data from those joined tables, otherwise I use subqueries. And I used (NOT) EXISTS in lieu of joins, initially because they were said to be more efficient. It turned out to be true because I saw significant performance increases when working with multi-million record related tables, at least in Oracle and DB2. I may be retired, but it's still fun working with SQL and figuring out the most efficient and effective ways to pull information from data.

@Serggio123 10 месяцев назад

If those execution plans are different with MySQL, it must be a bad RDBMS. These advices are bad. Adequate database engine uses statistics to make a plan. In fact it is really hard to make a bad query.

@TimothyWhiteheadzm 10 месяцев назад

This isn't about not needing joins, its about optimizing a query so as to not have to read all rows. Of course if you knew this was an important query you could optimize it further with special columns and indexes. What you ended up doing is really a join but instead of reading all rows it only finds the first matching row. There are probably even further optimizations you can do.

@trentonsmith8739 10 месяцев назад

This is honestly bad advice, poorly explained, and ignorant of the inner workings of these operators. The fact that the term correlated subquery doesn't even appear in a video mostly about the exists operator says enough about this. Not to even mention the fact that exists is still a join of sorts, you're just not using the join keyword. Just bad

@chrism3790 10 месяцев назад

I'd be careful with this advice. The IN clause usually has size limitations. In some database engines it's as low as 65k values. Not only can it randomly fail if the subquery has an uncertain return size, but you can also pay heavy performance penalties as it gets larger. I'd just use a CTE with DISTINCT, and join to that.

@alexaneals8194 10 месяцев назад

You have to watch out with Distinct, it can eat up your memory on huge tables since it does everything in memory. For extremely large queries, I would just write a proc so that you can take advantage of temp tables to improve the performance especially on large table joins that are used multiple times. However, each database engine will have different performance enhancements that might make one approach more effective on one database, but not on another.

@osamazch 10 месяцев назад

been there. personally i always found subqueries to be a foot gun in most of situations that i ran into.

@PlanetScale 10 месяцев назад

The IN clause has size limits when you're binding in params, not with a subquery. With a subquery it's a semi-join or an anti-join.

@Sk4lli 10 месяцев назад

@@PlanetScale good to know, I was actually wondering about it and about to look it up, but now my question has been answered. 😄

@dougnulton 6 месяцев назад

There’s no reason to avoid using WHERE EXISTS. That’s literally its sole purpose; to perform an existence check. It’s preferable to joins in this instance because DISTINCTs can be costly. And similar to joins, WHERE EXISTS can handle comparisons that involve “joining on” multiple columns, unlike IN which is just comparing 1 column vs 1 column. Additionally, NOT IN can result in some “gotchas” if one of the values can possibly be NULL. So really there is no reason not to use WHERE EXISTS / WHERE NOT EXISTS; it performs as well if not better than joins or IN/NOT IN, and has less caveats.

@douglasemsantos 10 месяцев назад

I've actually thought about it before, looking at some queries I could do either join or use a subquery, but I never knew exactly which one to choose, I kind of just followed the pattern of the project. This video clarified it for me! Thank you!

@mckola 10 месяцев назад

Hi, Aaron. What is the point of using a left join instead of an inner join in this query? Would an inner join be much slower compared to a WHERE subquery?

@minilathemayhem 10 месяцев назад

Yeah, I'm a bit confused by that as well. In general, an inner join should perform the same, if not better, than a left join. As well, we're looking for users that *do* have posts, so it makes more sense to use an inner join since that'll ensure only users with posts are actually pulled back. At my previous job, I managed to improve performance of a lot of queries by moving from outer joins to inner joins when they made more sense, and I feel like using an inner join in this example might bring the performance of the join query closer to that of the sub queries. Then, I think about the only real performance hit would be from the "distinct" operation.

@PlanetScale 10 месяцев назад

Yeah that first example would've been better with an inner join, but with this data it wouldn't have changed

@cesararacena 10 месяцев назад

You are both right and wrong at the same time. Everything depends on the DB engine, the structure of the data and even how many tables you need data from. Sometimes what you say is true. Some other times a temp table works best, or simply create real tables. Even the feared WITH clause (CTEs) work better sometimes. Or partitioning. Please stop making videos like your word is gold. Everything depends on the context and that's why people need to REALLY learn. BTW I work with over half a dozen DB systems both on premise and cloud every day, with tens of thousands of tables and most of them are hundred of columns wide with billions of records and yes, sometimes I use joins.

@Ostap1974 10 месяцев назад

I think subqueries are always worth considering if it allows you to remove deduplication from the query.

@magfal 10 месяцев назад

It can lead to indices to not be used efficiently when you're pulling out a large portion of the table.

@daleryanaldover6545 10 месяцев назад

@@magfalSounds like you need to add indexes, if the columns being queried is not indexed it would lead to that. The views column in the video is most likely indexed.

@davidmartensson273 10 месяцев назад

@@magfal If you only have one matching row in each table then a join can be better, the same of cause if you need the data from both tables and use group by and aggregate data or want multiple lines. But if the table used in the subquery will not directly contribute to the returned data but only be part of the condition, then a subquery can often be better since it can work with the tables separately and then match them on the result which in many cases is a much smaller set of rows, especially if you have limiting conditions on both tables. Modern SQL often can optimize this under the hood even with joins, have not worked with mysql and that kind of data volumes and even when I did it was over a decade ago so I do not know how good mysql it (probably leages better than back then ;) ) but I know MSSQL server can rewrite the execution plan quite a lot. I also know that at least mssql (and I assume mysql) can improve query plans that are used multiple times since once it has run once or twice it can use the actual return data to judge if the created plan was the best or if the data pattern might allow for even more optimizations.

@magfal 10 месяцев назад

@@daleryanaldover6545 A subquery used like this can result in a full table scan if a threshold of expected rows is met. I'm not saying not to use the technique but be aware that your results may vary based on the ratio and shape of your data and indices. You should A-B test subqueries, materialized CTEs and joining full table, they all have benefits and drawbacks in different situations and when using different SQL engines. Postgres is less prone to give you this problem but I've experienced it with both MSSQL and MySQL.

@t3hpwninat0r 10 месяцев назад

consider all options and then test them. you don't want to come back to this after 2 years and have to reverse engineer it in order to fix a performance issue. just make sure its good from the start.

@Maazin5 10 месяцев назад

Is it accurate to say that WHERE EXISTS works better when the table in the outer query is smaller than the table in the inner query (i.e. table A has fewer rows than table B)? If you were trying to do the inverse (e.g. select posts where the user was created on a certain date) , would a join perform better?

@reachrishav 10 месяцев назад

I'm interested to know this too

@minilathemayhem 10 месяцев назад

While what you're asking does make sense (if table A has more records than table B, then it must take less time to execute something against table A than B), it doesn't necessarily work that way in a database. In a database, you're usually performing filters against indexed fields, which essentially makes the query only have to read a small subset of the entire table. Basically, to really see how it's going to perform in your case, you really need to either execute both queries and see which is faster, or run an explain statement to see what the query optimizer ends up doing with the query.

@erichkitzmueller 10 месяцев назад

This depends on scale of the difference and also on the number of matching rows. For example, let's say you have just 10 "users" and a million of "posts", and an index on "posts" that lets us immediately find a post for a specific users with more than x "views". The "where exists" query just accesses the "posts" table by this index 10 times (once for each user) and completely ignores the remaining 999990 posts, whereas a join or a "and id in () ..." subquery probably result in accessing way more posts along the way. (Of course, there is always a chance that a super clever query optimizer in your database recognizes this case and switches to the faster access path). On the other hand, if you have 1000 users with a total of 1010 posts, and only 5 out of those posts have the required number of views, filtering posts first and (semi-)joining to users later is probably faster than accessing posts 1000 times for each user seperately.

@KristjanB 10 месяцев назад

I think this just shows that MySQL is not very optimized. The join should be just as fast. I mean this is the reason we use SQL instead of manually retreiving data in specific order, is to let the underlying data engine figure out the most efficent way of retreiving the data we want.

@gilbes1139 10 месяцев назад

The JOIN will not be just as fast because it has to compute additional rows in the results to accommodate the intersection with the right table. The IN/EXISTS has less work to do because it knows the results will only be some subset of a single set.

@minilathemayhem 10 месяцев назад

I wouldn't say it's that "MySQL is not very optimized". It makes sense because the queries, while producing the same result, aren't doing the same thing. For example, in the case of the join query, you have to perform an extra step, "distinct". "distinct" in every SQL database I've used is an expensive operation. The subquery method doesn't have to perform a "distinct" because it isn't returning back "duplicate" records. I'd imagine at least a decent bit of the join's overhead in this case is due to the distinct vs the join operation.

@KristjanB 10 месяцев назад

@@minilathemayhem Well the SQL engine should have used the information that the final results are distinct from the user table and skipped any joins on already matched users. Any well written SQL engine should have all the information to make predictive analyzis on how the joins and matches are performed. If the query was very complex then it's possible that distinct would be expensive but when it's very simply just the whole row of a table it should have been able to optimize the join and skip already matched results similiar to to the subquery is working.

@minilathemayhem 10 месяцев назад

@@KristjanB Every SQL implementation I've used (Postgres, SQL Server, MySQL) has had performance issues when performing distinct, so I don't know what to tell you beyond what I've experienced myself.

@Kaltinril 10 месяцев назад

Nice trick, and it can help. But there may be an underlying reason that it's performing worse. I'd like to see the # of rows in each table, the table definitions, the keys, the explain plans on each query, the last time statistics were run to really see what MySQL is doing.

@HMan2828 10 месяцев назад

You should always use a join if the keys you are using to join the tables have a matching index on said tables... A join on properly indexed tables will always be more performant and efficient. That's the whole point of a relational DBMS... The only way the subquery will be faster is if there is no index on your join keys.

@daleryanaldover6545 10 месяцев назад

Wow this is new, I assumed the views column here is indexed. Also the ids would definitely be indexed.

@proosee 10 месяцев назад

You are oversimplyfing thing so much.

@alzamer88 10 месяцев назад

an issue he mentioned here is the duplication and that is why he went from join to subquery. if you use a join in this example, you will duplicate records that you will later discard.

@dwiatmika9563 10 месяцев назад

not really

@spacemanmat 10 месяцев назад

No in all of the cases here the database will join the tables. If it didn’t it would not be faster.

@yesnickcarter 10 месяцев назад

i don’t like the new tend in tech videos of creating interesting and valuable videos, and giving them bullshit clickbait titles.

@timseguine2 10 месяцев назад

I tend to write joins by default because subqueries used to almost always be slower, and so my brain mostly thinks about database queries in terms of joins now. But in any case subqueries are always preferable to manually doing n+1 queries. My usual advice for programming in general applies though I think: write whichever is the clearest expression of intent first, and then optimize it only if necessary.

@Nworthholf 10 месяцев назад

My rule of thumb is inner join > CTE (sql server)/subquery (non-sqlserver) > outer join

@yaroslavpanych2067 10 месяцев назад

And here I'm expecting some new/secret SQL operation/technique... while he just explains how exactly task should be done at all.

@cherkim 10 месяцев назад

How does this work with left joins? When I don’t want records that don’t exist in table B?

@JuddMan03 10 месяцев назад

That's just a normal join.

@neparkiraj 10 месяцев назад

Isn't WITH statement better then nested selects? WITH x AS (SELECT... ) SELECT * FROM x WHERE.....

@dertechl6628 10 месяцев назад

good question

@spicynoodle7419 10 месяцев назад

Hey Aaron, since PHP8 you can indent your docstrings. You don't need to glue the body of the docstring to the first column in your editor. The body will be indented automatically based on the indentation of the variable declaration!

@FGj-xj7rd 10 месяцев назад

I am so thankful that PHP 8 added this. Makes writing raw queries so much cleaner and better. Old syntax was horrible.

@Pekz00r 10 месяцев назад

Ah, thanks. I didn't know that.

@fairphoneuser9009 10 месяцев назад

Well. It's "You don't need JOINs in cases where a JOIN is just the wrong choice"...

@PlanetScale 10 месяцев назад

Nice, nailed it

@robertminardi4268 10 месяцев назад

Just like any thing else, the answer is more nuanced then that. If you try this with 3 or more tables, and you need columns from the inner most sub-query table, you end up falling into a nested query hell where you have to bubble up columns from the inner most query to the outermost query. Also, not sure why you used a left join because, by definition, it should be slower than an inner join for the same reason the sub-query is faster in your example. Left join returns rows you don't need, while an inner join returns only what you need. Performance aside, just the readability issues with nesting more than a couple queries might overtake the slight performance benefits. When I started working for my current company, we used a lot of sub-queries and they resulted in these giant unreadable masses, with alias references for each level of nesting. BUT it's good to think of queries in a different way, and having a reduction of data mindset from the beginning is certainly a good pursuit.

@arturomedina2055 10 месяцев назад

What is the program he's using to write the queries?

@Voidstroyer 10 месяцев назад

The good stuff: Props for mentioning that this is only useful when you only need data that comes from 1 table. If for example you need the users and the posts which satisfy the condition of more than 90000 views than you would have to use a join. The not so good stuff: 1. this dataset is really small. It's kinda similar to comparing a javascript array to a javascript object. Depending on the size of the data, one will perform better than the other. 2. Based on the condition, an inner join would have been better since we don't care about users that don't have any posts at all. The condition could also have been placed on the join

@boccobadz 10 месяцев назад

In the real world, you rarely need data from 1 table. That's the whole reason for using databases in the first place, otherwise using flat files with filters would be the way to go. Not only benchmark is done on a toy dataset, it's not even written in the way you would write it (left vs inner, condition placement, etc). Kinda clickbaity.

@Isr5d 10 месяцев назад

With large sets, EXISTS will be more performant than IN, however, with small sets both IN and EXISTS will have the same performance, and sometimes IN would out perform EXISTS. The reason that you should consider EXISTS over IN, is that IN will select all matching results and then compare, unlike EXISTS which returns boolean. Always use Explain Plan to help you decide which query optimizations are needed. For subquery part, it does matter only if it's applied within the correct execution order of the query. For example, if you want subquery to load first then the main table, you need to switch sides. If you are expecting just one value from a subquery, then use it within SELECT. and so on. You will need to review the query, try to move the subquery around to see which version would be best. Finally, understanding the query's execution order and plan work, will save a lot of efforts, and give better decisions.

@TheKevinbigfoot 10 месяцев назад

From the beginning I was sitting here yelling subquery! Because the DB I work with is so large, I have had to play around with queries to increase their efficiency, and using this trick to subquery the ids has reduced the cost of my queries by 10x. Super essential, and powerful when used correctly.

@po6577 10 месяцев назад

the time has come!

@Ezechielpitau 10 месяцев назад

These are actually faster? I assume because we're not joining over data we'll throw away later anyway?

@sphinx00 10 месяцев назад

If DBMS optimized joins with sideway information passing one would have got the same benchmark results. SQL is a declarative language and one should not bother with such tricks to get an optimal performance. It's just up to engine to do right things. Too bad MySQL does not do it.

@zh0r1k 10 месяцев назад

This is a perfect video and example mate, this is exactly the same in PostgreSQL. Sometimes you look at the Explain Query and you don't realize that you can swap the JOIN for a Subquery and get 100x the performance out of it. Once, on a big dataset with joins we had ~15-30s execution time vs using a Subquery ended up taking only 250-500ms. It was related to a large work memory usage by producing a really large hash table join.

@darrennew8211 10 месяцев назад

The real answer is "learn SQL, and learn how to check what the optimizer is doing."

@Zomp420 10 месяцев назад

Doesn't MySQL have CTE. This would more readable than a sub query but it pretty much the same thing.

@Zomp420 10 месяцев назад

To clarify your pattern for when to use join versus sub query. It boils down to determining if table B is a one-to-many and if that is what you want your results to look like. When you want to do a one-to-one then you sub query it. It comes down to knowing the relationship of your tables and what you need the end result to look like.

@ischmitty 10 месяцев назад

Great video! I’m curious why you used a left join instead of inner join in this case?

@xO_ADX_Ox 10 месяцев назад

What code editor is that?

@jlimas 10 месяцев назад

tableplus

@leroymilo 10 месяцев назад

I would do this (I don't work with MySQL specifically): SELECT * FROM users as u JOIN ( SELECT DISTINCT user_id AS id FROM posts WHERE views > 90000 ) AS p USING(id) Any reason why not to do this? There's little chance I'll change my mind.

@parihar-shashwat 10 месяцев назад

Actually laravel eloquent ORM does this kind of query more often when I see in debugbar.

@MisterVcc 10 месяцев назад

PROBABLY faster... MAY BE more preformant... POTENTIALLY work with less data... So, when are you running for the office?

@Lycanite 10 месяцев назад

I've always only joined when I want that data otherwise I use the crap out of subqueries and everything performs great.

@darrennew8211 10 месяцев назад

This isn't "You don't need joins." This is "different SQL expressions for the same relational operation that the current version of MySql may optimize differently, but we don't know because we never actually show what the optimizer decides."

@viniciusqueiroz2713 10 месяцев назад

PLEASE use CTEs instead of Subqueries! They make your code WAY more readable, and should have the same performance than using subqueries. In fact, it might even be faster, when you are reusing the same subquery in several different places in your query.

@bordeux 10 месяцев назад

Not fully agree with you. It really depends of context what are you really going to do. It is edge case when you need to list the users who have 9k views. Normally you are doing ranking in 99%. So you need to sort it by views, but with subquery it is not possible. Also subquery IN/Exists is not scallable due to the size of data. If you have more and more dataset, next paginated result will be slower and slower. If you have 5 mln of users who are passing your criteria, last page of pagination will take forever. I recommend always to make a tests on really huge dataset, because it will show the truth about hidden costs of the queries :) Could you share your scenario to some sql fiddle/db fiddle? then i can show you the prof

@Terrados1337 10 месяцев назад

Why must SQL always be the screeching kid in the car? Comparing it to any other programming language it looks like someone threw the syntax in a blender and stopped it when the garbled mess stopped crying. If C was like SQL it would be using "mayhappenstance" instead of "if" and there were only do-while loops.

@Ravenwish1990 10 месяцев назад

Left joins are not great in general with queries of this kind... And I'm not sure about MySQL, but from experience with Oracle DB, IN with subqueries tend to be pretty bad.

@Pinkeseinhorn 10 месяцев назад

Why not join with the subselect? Running oracle12 with a dataset of 7million entries joining on 14million I had about 50÷ +- 20% speed up on my benchmarks 40% range is huge I know but it was a shared db

@user-fed-yum 10 месяцев назад

Saying that we don't need joins is one of the stupidest things I've heard this week. Time always proves draconian black and white rules as naive and significantly flawed. Worse is how this form of look-at-me spin draws in the inexperienced, potentially significantly impacting their personal growth. Maybe you need to go back to marketing school. Maybe you need to rethink what you are trying to achieve.

@primaxm8845 10 месяцев назад

"in"?!? Seriously?!? "IN" is too slow in the big databases! "Join" is slow too! I'll pick the second option. Single select for the post and many select to get the user's data. It is fast, it is simple, and it makes many small quarries, instead of one big and slow, which can lock the table!

@RnRoadkills 10 месяцев назад

This is wrong. It depends on many things what to use. How you index the tables for best perfermance, that again depends on the usage (selects, updates etc). The overall structure of the whole database and tables. Remember - when you write a select, you are not telling the sql server what to do, but what you want. The sql server will compile your request into the best possible perfermance that it thinks would be to fetch the data you asked for. Just showing a simple select and concluding you should not use joins is just plain wrong.

@sgsfak 10 месяцев назад

Two remarks: 1) if you do a left join and then add a where condition from the right table (e.g. "views > 9000" as shown in the video) then in fact you have an inner join. Which is of course the correct join type to use for this problem because you want only the users with popular posts. 2) Instead of subquery I would use a join with the filtered users, eg. with some CTE as shown next: "WITH u AS (SELECT DISTINCT user_id FROM posts WHERE views>9000) SELECT users.* FROM users JOIN u ON u.user_id=users.id; " I feel that this aligns more properly with my mental model on how the data should be processed and also it would have a good performance (but of course you need to benchmark on the specific dataset you have :-)

@noredine 10 месяцев назад

"Never use a sub-query" Me remembering the triple nested query I wrote the other day 👀

@chiefcloudarchitect811 10 месяцев назад

but the first query take 5 milliseconds and second 7, with exists() seems slower ?

@ndchunter5516 10 месяцев назад

Now start to compare the execution plans per db engine for actually equivalent results... So many bad queries are the result of not knowing/ specifying what information you want exactly

@Dawkujacy 10 месяцев назад

I down voted because its I think wrong I mean its wrong example. Can you show execution plans to compare to show what you saying is acutal what the db is doing? As well if you created index on those tables it should be much faster then sub query.

@stackercoding2054 10 месяцев назад

When I started to learn SQL I really struggled understanding joins, and usually I ended up doing subqueries exactly like this to avoid using joins because I felt like running SQL code that you don't understand can scale problems very fast, specially when data looks fine but in reality its giving fake results.

@darekmistrz4364 10 месяцев назад

I think a lot of people who use joins don't understand them. So many bugs I fixed with changing joins to subqueries.

@reed6514 10 месяцев назад

I think it is best to be confused. Keeps it more exciting. Will the database burn? Iii dooont know! Lol 🤣

@bazoo513 10 месяцев назад

For any RDBMS worth its salt these two queries will be equivalent. That's why we use RDBMS, and not raw index-sequential files.

@itsmill3rtime 10 месяцев назад

if its for condition, i just use where exists or not exists, if i need a piece of data from other table then join

@theherk 10 месяцев назад

This is such an excellent informative video, but I just cannot support the bait and switch title garbage that entirely undercuts the message. It is so disingenuous and vexing.

@ComfyCosi 10 месяцев назад

Love your videos Aaron! Very high quality content. What do you use to record and edit them?

@orlovskyconsultinggbr2849 10 месяцев назад

Crazy joins not needed, i so so many times people just going mad on joins. Join always introduce performance hits, i better prefer to use prepared views or even trigger functions.

@Hersatz 10 месяцев назад

To think I had to do all of my SQL stuff from a command line console at school when this kind of tool exists. Why are we here just to suffer?...

@anon746912 10 месяцев назад

As a sql dev, this video is hilarious. Well done on the extremely baity thumbnail though. Doing youtube properly.

@etexas 10 месяцев назад

DBA tells me not to use joins, Application lead tells me not to use sub queries. I think they should fight it out themselves instead of using me as a football.

@uuuummm9 10 месяцев назад

I cannot believe someone can join and then do distinct on users.*. This guy invented a problem and recorded a video how to solve it + clickbate title.

@davidcram5756 10 месяцев назад

Why not use a CTE?

@codepipeline 10 месяцев назад

Maybe you should say MySQL explicitly. Also who is choosing MySQL in 2023?

@duramirez 10 месяцев назад

I very very very rarely use DISTINCT in my query, if I do, it's because there is no alternative.

@chadef555 9 месяцев назад

You've mentioned it's MySQL thing. Would that be the same for MS SQL?

@binaryfire 10 месяцев назад

"According to... the documentation." I see what you did there.

@BagusAndrian 10 месяцев назад

Hi Aaron, cmiww Instead of u create your own benchmark, IMO the best practice to calculate cost of the query is using command EXPLAIN ANALYZE. Why are u not using that command?

@fmkoba 10 месяцев назад

now Im curious on the use cases of query decomposition

@ccgarciab 10 месяцев назад

Yes. In my inexperience, I've always assumed that doing more network requests is worse. It's a heuristic that has served me well enough for now, but it would be cool to see how it might not be always true.

@lavisharma3210 10 месяцев назад

Thank you planetscale team for creating such videos and more, get to learn a lot

@spacemanmat 10 месяцев назад

Read the title, thought “Ok we’re doing Cartesian Products are we?”

@MrBrannfjell 10 месяцев назад

Fix your title to: SOMETIMES you dont need JOINs. Clickbait Mo flipper :P

@PlanetScale 10 месяцев назад

That's what the asterisk is for 😉

@maximus1172 10 месяцев назад

What IDE is this? Looks amazing

@yakirgb 10 месяцев назад

TablePlus

@i3looi2 10 месяцев назад

JOINS are highly inefficient with scale. Had this issue on a `users` table with 20M rows linked to a `games` table with 100M rows. The fastest way was to use SELECT * FROM users, games WHERE games.user_id = users.id AND (additional filters on games table) Talking about up to x20 performance.

@amrishshah8982 10 месяцев назад

Awesome video

@ColaKingThe1 10 месяцев назад

I liked this video

@jacmkno5019 10 месяцев назад

Limit 1 to force stop de cursor after the first match in the subquery. That's the maximum optimization of the subquery version: SELECT u.* FROM users u WHERE (SELECT 1 FROM posts p WHERE p.user_id =u.id AND view_count > 90000 LIMIT 1) IS NOT NULL; The reason a subquery is better than the join here is because you don't care about each post that matches. Without the LIMIT you leave the performance up the the SQL Implementation Gods... Remember MySQL and MariaDB branched out quite a long time ago...

@cmohanc 10 месяцев назад

subquery is way slower than join. I use subquery where no other option is available.

@atlantic_love 10 месяцев назад

What is squl?

@realoctavian 10 месяцев назад

what about select * from users u, posts p where u.id = p.user_id and p.views > 90000 ?

@timoconnell6133 10 месяцев назад

Can't you just inner join to the subquery instead of making it an IN statement?

@user-tk2jy8xr8b 10 месяцев назад

Correlated subqueries look so much better than joins, much more natural

@userasd360 10 месяцев назад

How can one communicate with you?

@gfeher123 10 месяцев назад

ridiculous. You could say something more bs like you dont need database

@puneetsharma1437 10 месяцев назад

thanks for tips

@bbbbburton 10 месяцев назад

I think some of those stack overflow people mentioned have showed up in the comments

@RockTheCage55 10 месяцев назад

before i end watching the solution `exists` or 'in` would work here

@tradingisthinking 7 месяцев назад

these are being translated to the same thing in the background

@7th_CAV_Trooper 10 месяцев назад

Join and sub query are the same thing. You can view the query plan to see this. Pro tip, it's the SQL server's job to optimize the query.

@PlanetScale 10 месяцев назад

This is in MySQL, I don't know anything about SQL server!

@lucass8119 10 месяцев назад

Join and Subquery are not the same thing, they perform different functions. If they were the same thing we wouldn't be seeing all this advice to avoid subqueries - because they could just be translated to joins. They're different operations.