RAND is too slow (in MySQL)

Подписаться 38 тыс.

Просмотров 14 тыс.

50% 1

👉 Watch our video on 'Covering Indexes' here: • Faster than a regular ...
To learn more about PlanetScale, head to planetscale.com/youtube!
💬 Follow PlanetScale on social media
• Twitter: / planetscale
• Discord: / discord
• TikTok: / planetscale
• Twitch: / planetscale
• LinkedIn: / planetscale

Наука

Опубликовано:

16 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 45

@scottp131 7 месяцев назад

hey man your videos kick ass and i cannot thank you enough for your approach with these. your videos can be watched once and understood... every single one of them... i don't know how you do it, but the way you have picked to teach anything you teach is incredible. you freaking rock! thank you!

@PlanetScale 7 месяцев назад

Gah that really encourages me. Thank you

@Wundero 7 месяцев назад

One other option for the "hash" variant is to convert the hash into a numeric column (take the bottom half or quarter of md5 (a 128 bit hash), for example) and then modulo that by some arbitrary number. The modulo would help produce groups that you could then filter by and randomize the order of. Ideally the number is determined by your data size, since a small number gives larger groups, so if you have like 1000 rows, % 3 or % 7 might work well, while on millions of rows, something % 17737 might work better. The concept is similar, in theory, to a hash table, where you would take the hash % the size of the table to figure out what group to put them in. This also lets you "seed" some of the randomness, especially if you do the remaining randomness in your application, which can let you have randomized pages of data, and you can just change the modulo if you want to change the "seed".

@olivierm.594 7 месяцев назад

On your first example, you may have collisions (same id appears twice or more), i think a simple "distinct" could avoid that (if the list of ids you're parsing is small enough)

@samho321 7 месяцев назад

in my case, primary key is uuid, i can randomly pick 4 char inside 32 length from primary key and sort it and limit it, also i able to make those 4 random positions in to a random seed for pagination

7 месяцев назад

Another way: Create a generated decimal column, with default value is rand() function. Add index to them and use-it.

@PlanetScale 7 месяцев назад

Good idea! That would lead to rows being really "clumpy" though. Rows with the same or similar random numbers would be returned together. Also a generated column can't have a RAND default unfortunately

@IsaacWalkercox 7 месяцев назад

@@PlanetScale why can't Rand() be the default for a generated column? Too expensive computationally?

@devhammed 7 месяцев назад

@@IsaacWalkercox Because the function is not deterministic, A function is deterministic if, given the same data in tables, multiple invocations produce the same result, independently of the connected user. This is generally in place for features like master-slave database replication.

@IsaacWalkercox 7 месяцев назад

@@devhammed makes sense. The index would need the same information per user. Seems like something they could get round with some sort of wider context though

@RandomGeometryDashStuff 7 месяцев назад

does this make result always same?

@gfni 7 месяцев назад

Great video as always! 👏

@vasiovasio 7 месяцев назад

Super Useful Video! Thank you, Aaron!

@jacksonzamorano 7 месяцев назад

Wow! I never knew about covering indexes. I don't need to do a random pull like in the video but I could definitely put those indexes to use. Great video, keep it up!

@PlanetScale 7 месяцев назад

Check out the video right before this I've, it's all about covering indexes! The one with the timer in the thumbnail

@cafehodge 7 месяцев назад

for gaps in auto incremented ids or uuids you could create a row_number with a window function to use the rand() function against and guarantee results will be returned. but idk how computationally expensive that’d be.

@___dyego 5 месяцев назад

You're an amazing teacher!

@tamasnogradi4524 7 месяцев назад

This was really interesting mate! Love your videos, you filled a huge gap in related db topic for a lot of people.

@PlanetScale 7 месяцев назад

Thank you 😌

@IvanMontilla 6 месяцев назад

I'm pretty sure someone will implement a random password generator this way, by storing all possible passwords in a table and then letting MySQL do its magic.

@ahmad-murery 7 месяцев назад

I liked the bucket method, Thanks Aaron!

@sungvadan 7 месяцев назад

Thank you so much for the video. I have an application that I use to learn vocabulary. It sends me ten random words by email every day. I'm using the random() function, which is fine because the vocabulary table is small. However, I've always wondered if there are better ways to do this. Thanks to your video, I now know some alternative methods.

@roganl 7 месяцев назад

Excellent Video. Picking up where you left off, using the hash column index to expedite the lookup, how would generating rand (n) hex values improve randomness (16^4 space) over just using two hex buckets ? Would it make sense to generate multiple two digit hash columns and use them both?

@AlexCernat 7 месяцев назад

yeah, deferred joins and covering indexes rules! great idea using hashes for "randomness"!

@DuniC0 7 месяцев назад

I love videos that get me into start thinking. What if you strip down the size of the data several orders of magnitude with something like WHERE RAND() < 0.001 And then just ORDER BY RAND() only for the subset?

@daniel_petrica 7 месяцев назад

I got a similar problem with my plantescale free db where my bad code hanered the rows read the first month with the order by rand(). I ended up doing a less random solution using cache and code. In php I generate a random int inside the count of table rows (so i dont care about id gaps), and i get n rows from the db with that int as an offset. Then I randomly order them in php and return an array. As I increase the amount of rows extracted, I'm caching that query in Laravel for a x amount of time and on each request I pick from the cache as with a big enough amount of rows the limited randomnes is good enough in my application. To reduce the time in the wire, I only select the columns I need for the use case, and I have an index on the columns needed. The cache can be generated by a worker to avoid having a slow request every x second to the web user.

@Kane0123 7 месяцев назад

If its fast enough, its fast enough... relevant in so many places...

@mamad-dev 7 месяцев назад

i have a question, what is the application that he's using ?

@hannielgutierrez9139 7 месяцев назад

If i try to make a virtual field an index, then Maria DB tells me this: 1901 - Function or expression 'concat(`fecha`,' ',`hora`)' cannot be used in the GENERATED ALWAYS AS clause of `fecha_hora` Does it works just in MySQL?

@kriansa 7 месяцев назад

Would that inner join at 13:14 be effectively replaceable by a where exists keeping the same performance?

@mme725 7 месяцев назад

I'm wondering this as well.

@balduin_b4334 7 месяцев назад

If I had to make a guess, I would say that it is less performant, the inner join is a subquery, so it is executed first, and then it is getting joined on the primary key of the bookmarks table. "where exists " I would think is getting looked up (or even executed) for each row. (not an expert btw)

@OliverONeill 7 месяцев назад

He did explain this around 6:00. RAND() is a volatile function, so it being in the WHERE clause calls RAND() over and over again for each row.

@tylerslater 5 месяцев назад

What client GUI are you using?

@troublescoot 5 месяцев назад

Looks like he's using TablePlus

@Voltra_ 7 месяцев назад

Reservoir Sampling. 'nough said

@PlanetScale 7 месяцев назад

Perhaps not quite enough, I've never heard of Reservoir Sampling! Will have to look that up.

@Voltra_ 7 месяцев назад

@@PlanetScale here's the cppcon talk I've learned about it from: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-YA-nB2wjVcI.htmlsi=H40wG1Iha9yOygHZ (I resisted the urge to make it a Rick roll)

@youtubehotdogdogyoutube 7 месяцев назад

Yeah, maybe in your SQL

@RayHorn5128088056 7 месяцев назад

Redis the rows and choose randomly from cache. Problem solved.

@PlanetScale 7 месяцев назад

Redis the rows? I'm not sure what that means

@RayHorn5128088056 7 месяцев назад

@@PlanetScale Cache = Performance. Kudos on doing all that work to get your database to cough up the data but caching is always faster.

@PlanetScale 7 месяцев назад

I understand that 🤗 But are you saying store all the rows in Redis? "Redis the rows" is unclear to me

@RayHorn5128088056 7 месяцев назад

@@PlanetScale I am always saying cache just enough data to speed up the process. This should always be the goal. Trust me, there is an optimal solution involving a cache.