Amazing Explanation!!! I was loosing my mind trying to visualize how the various sort keys in Redshift work and you made it soooo simple. Appreciate your hep!
While single and compound keys are no brainers, I had to go to multiple websites to learn the interleaved sort key but they offered no clear explanation. You ended my quest here! Very nicely done.
Kasi, Thank you. I, unfortunately, had to hear about 20 explanations myself before I went, "ooooh", that's how it works". I love helping intelligent people with great experience already take the next step towards "Total Guru". You are almost there!!! Great job.
Simple...to the point and very clear. A quarter of the video is just review of the topic discussed (good practice but its a video so it can be replayed). GREAT VIDEO!!!
Nicely explained, and easily understandable for newbie like me. Also if possible can you share any sys table where you can see exact like blocks (block1..blockN) you have shown in your video, so while we try to do practice we will get more familiar on this. Thanks again for this video.
Good one. I have a question here. Out of the below two, which is the most efficient way? Accessing one block by filtering on the distkey (or) having even distribution and accessing multiple blocks? some people claim that accessing many blocks will allow parallel processing , which means better utilization of all the nodes.
Preetham, There are two philosophies when querying big data well. You want to use the Distkey when going after a single record (row). This only accesses one block and a single slice. When you need to analyze many rows (thousands-millions) you want the parallel processes to each do an equal amount of the work. You want the Distkey for one or two rows and you want parallel processing for large queries.
Thanks. Can you make a video on whether the sort key column should be compressed or not? If no, why? If yes, what are the benefits and the recommended compression technique?
I feel that you are kinda skipping over slices here a bit. As far as I know, with a distkey on Cust_ID in this case, each slice will contain only 1 cust_id, and since each slice has one compute node, you would only have to read one block, pr. slice. So you would be reading 4 blocks yes, but from 4 different slicses, using 4 different nodes and thus, it would perform as fast as a single block read--- right? or I am wrong here? I would be very interested in knowing how interleaved sortkeys work with an even/key distribution style.
All the explanation on the sort keys is simply superb Iam stuck with understanding vacuum reindex that we do to sort the rows can you help me out on this
Surabi, Thanks for your nice comments. Superb is good!!! What has been a difficult concept for me is that you can sort or index a table, and it sorts and indexes. But if you load more data tomorrow and the next day and so on the data is no longer sorted (Interleaved sort especially). A reindex refers to an Interleaved sort key. Here is what Amazon says... "When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Amazon Redshift automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you need to run VACUUM REINDEX afterwards to initialize the interleaved index."
@@Coffingdw Thanks for your timely response. What I don't understand is the value of interleaved_skew that we use to decide whether to reindex a table or not and after multiple tests I have no clue on what factors will the interleaved_skew depend on? Please help me in that sense
Surabi, Imagine you have 9-months of data and the month is September. You use an Interleaved sort key on your table (9-months of data) and your data has two sort keys with equal power. The queries work great for the table because whichever key they query on they get great result speeds. Now, imagine it has been one year. The last three months have simply been appended to the end of the table. They are not sorted with the previous 9-months. When you do a vacuum reindex all one year of data is now sorted properly.
@@Coffingdw I get that completely. Iam completely with it that we need to do vacuum reindex to reorder the data once in a while. But, my question is on the term interleaved_skew and on what factors does it depend(reasons for its change). You can find interleaved_skew in svv_interleaved_columns.
The comment about slices (01:51) and distribution keys almost tripped me over. I feel that is a little unnecessary. Apart from that it is nice explanation. IMO in discussion of SORT keys, slices and distribution key should not pay a significant role.