Hey Jordan, I recently joined Google as an SSE and I wanted to express my sincere gratitude for your system design videos, especially the ones comparing multiple solutions. Those comparisons were exactly what the interviewers were looking for in my feedback.
at 19:00, what do we mean by saying "add as an index entry"? is keep vector1 as index, and v2:v3:v4 (nearby) as a column? or v2:v3:v4 as an index entry? (know limited about vector DB, but trying to understand each vector is represented as geohash, and can be indexed on a single vector?
Is my understanding correct that there will be as many Bloom filters in recommendation service as there are users that connect to it? Secondly, as I keep watching more and more videos, my specific Bloom filter would quickly fill up in some days or maybe months. How does our system deal with Bloom filter? Basically filling up all the slots because of plethora of videos that I might have seen over months
We can have as many or as few as we want since they're just an approximation. We'd have to experiment in practice. Eventually, you just clear it, and let it get filled back up again :)
It's pretty nice for column compression, if we can get any, and I do believe the files should be immutable once written. Do you have a different proposal?
Great video! I did try to digest and understand what you are talking about :) Still got one question: why sharing the vector database by the vector hash won't result in a hot partitioning problem, the same way shading the neighbor index by the vector hash will?
I think that in theory it could, but most of the additions to the vector data base are being done asynchronously in the background, and so we have more flexibility to temporarily stop all writes and rebalance as needed. We'd want to shard in a similar fashion though, where vectors with close proximity are near one another.
Great video! I see that you used a heap for new entries into the closest neighbor index. Isnt insertion time into a heap the same O(Logn) as would be in a db index which uses B+ trees? Do understand that in the index we might need to replace multiple rows vs using a heap that wont happen. Is that the optimization here? Trying to understand how this speeds up things.
Great video Jordan! Learned a lot from this video. One question on Recommendation Service -> Neighbor index flow at 41:18 . Since we are sharding Neighbor index by entity_id, the recommendation services, in case of cash miss, has to scatter and gather right? Entity 12, 13, 62 (examples in the slide) could be in different partition
They would have to fetch the neighbors for their last x watched videos. So for each of those x videos, all of its neighbors will be on the same node, but otherwise we may have to hit up to x different partitions.
Hey! I'd probably start reading some white papers! As for which ones, there are like 10-20 tools on LinkedIn who only post links of other people's content on their pages, hopefully one of them is decent
Great video Jordan! Can you do one for ACID based system like Digital wallet or Bank ? or a combination of both like bank to wallet and wallet to wallet may be ?
Where do you see the challenge here? At least to me, this initially just feels like you'll need ACID databases, or two phase commit when making a transaction between two accounts on different partitions.
Hey Jordon, great content. Thank you for making these videos in depth. QQ- Do you think we can use graph databases such as neo4j instead of neighbor index for faster reads.
I think that you could, but consider this - for every vector (which is an arbitrary set of points), you'd need to create an edge to other vectors, so that you can traverse the graph. How do you decide which ones to do that for? Even then, let's imagine you could - you'd still have to run a breadth first search to find the closest vectors. I'd think that pre-caching your answers here will just about always be the fastest option.
@@jordanhasnolife5163 Hi, don't mean to rush you but I have some important interviews in coming weeks and having your notes will really help me prep better. Can you share them in any form. I understand there can be mistakes or typos in them but I want to be able to quickly revise all the overarching concepts and designs
@@vipulspartacus7771 Hi Vipul - understand your rush here, it will take me a few hours to properly export everything, which is the reason for the delay. I haven't sat down and done it. Additionally, once I do, I'd like to publicize that a bit, as I hope that they can help me build my following if we're being fully transparent here. My original slides contain all of the same information.
does this design account for popularity/ trendiness of a given entity? For example if a random video from an unknown creator becomes suddenly extremely popular (happens a lot on tiktok) it should be recommended whereas an hour previous it was unpopular and irrelevant thus should not have been recommended
It does not, and good point! I think for something like this you'd want to see the Top K video, and basically keep a cache of which videos are "trending" in the last x hours to apply a score boost.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/Jordanhasnolife/ . You’ll also get 20% off an annual premium subscription.